feat(benchmark): add append/pk table benchmark by lucasfang · Pull Request #302 · alibaba/paimon-cpp

lucasfang · 2026-05-26T01:56:12Z

Purpose

Linked issue: close #100

Add a Google Benchmark based performance suite covering:
1. BM_Write (append table write)
2. BM_Read (append table read with prefetch_parallel variants)
3. BM_PK_Write (primary key table write)
4. BM_MOR_Read (primary key MOR read with prefetch_parallel variants)
Add a benchmark entrypoint and custom CLI parsing for:
1. --paimon_source_parquet
2. --paimon_external_table_path
3. --paimon_file_format
4. --paimon_pk_columns
5. --paimon_option (repeatable)
Add benchmark helper utilities for:
1. Validation and skip behavior for invalid configurations
2. Shared read iteration execution
3. External table read mode (skip pre-write stage)
4. Source parquet loading and cache reuse
Build integration updates:
1. Add PAIMON_BUILD_BENCHMARKS option (default OFF)
2. Add benchmark subdirectory and benchmark target
3. Add add_paimon_benchmark macro and benchmark labels
4. Add benchmark dependency discovery via FindbenchmarkAlt
5. Add PAIMON_BENCHMARK_BUILD_VERSION=1.9.1 in third_party versions

Tests

UT:

CliOptionParsingTest.ConsumeCliOptionWorks
CliOptionParsingTest.ParseCsvColumnsWorks
CliOptionParsingTest.ParseCsvColumnsRejectsInvalidInput
CliOptionParsingTest.ParseDelimitedOptionsWorks
CliOptionParsingTest.ParseDelimitedOptionsRejectsInvalidInput
CliOptionParsingTest.ParseStringOptionArgWorksForEqualsAndSeparatedForms
CliOptionParsingTest.ParseStringOptionArgRejectsMissingValue
CliOptionParsingTest.ParseCsvOptionArgAndDelimitedRepeatableOptionArgWorks

Benchmark smoke:

BM_Write
BM_Read/1, BM_Read/2, BM_Read/4
BM_PK_Write
BM_MOR_Read/1, BM_MOR_Read/2, BM_MOR_Read/4

IT:

No new integration tests in this PR (benchmark and build integration scope)

API and Format

No change to public API under include.
No storage format or protocol compatibility changes.
Changes are limited to benchmark executable behavior and build wiring.

Documentation

No user-facing documentation change required for this PR.
Benchmark usage documentation can be added separately if needed.

Generative AI tooling

Generated-by: GitHub Copilot (GPT-5.3-Codex)

Copilot

Pull request overview

This PR introduces an optional Google Benchmark-based performance suite for Paimon (append-table write/read and PK-table write/MOR read), along with CMake/third-party wiring to build and run the benchmarks via CTest labels.

Changes:

Add a PAIMON_BUILD_BENCHMARKS build option and a benchmark CTest label/target integration.
Vendor/resolve Google Benchmark as a dependency (bundled/system) and add a FindbenchmarkAlt.cmake module.
Add a benchmark executable with custom CLI parsing and shared helper utilities + unit tests for CLI parsing.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
third_party/versions.txt	Adds Google Benchmark version metadata for bundled dependency download.
CMakeLists.txt	Adds `PAIMON_BUILD_BENCHMARKS` option + benchmark CTest target/labels and subdir wiring.
cmake_modules/ThirdpartyToolchain.cmake	Adds Benchmark dependency resolution and bundled build rule.
cmake_modules/FindbenchmarkAlt.cmake	Adds a “system” find module for Google Benchmark.
cmake_modules/DefineOptions.cmake	Adds `PAIMON_BUILD_BENCHMARKS` and `Benchmark_SOURCE` options.
cmake_modules/BuildUtils.cmake	Adds `add_paimon_benchmark` / `add_benchmark_case` helpers and CTest labeling.
benchmark/CMakeLists.txt	Defines benchmark executable target and CLI parsing unit test target.
benchmark/read_write_benchmark.cpp	Benchmark entrypoint with custom CLI parsing + Google Benchmark init/run.
benchmark/cli_option_parsing.h	Inline parsing helpers for custom benchmark CLI options.
benchmark/cli_option_parsing_test.cpp	GTest coverage for CLI parsing helpers.
benchmark/benchmark_suite.h	Declares benchmark runner functions and CLI helpers.
benchmark/benchmark_suite.cpp	Implements benchmark suite: table setup, write/commit, read iterations, caching, CLI handling.
benchmark/benchmark_helpers.h	Declares shared validation/skip and read-iteration helpers.
benchmark/benchmark_helpers.cpp	Implements validation/skip behavior and shared read-iteration runner.
benchmark/benchmark_case_write.cpp	Registers BM_Write benchmark.
benchmark/benchmark_case_read.cpp	Registers BM_Read benchmark variants.
benchmark/benchmark_case_pk_write.cpp	Registers BM_PK_Write benchmark.
benchmark/benchmark_case_mor_read.cpp	Registers BM_MOR_Read benchmark variants.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

zjw1111 · 2026-05-28T10:22:06Z

+
+set(PAIMON_BENCHMARK_STATIC_LINK_LIBS
+    paimon_shared
+    "-Wl,--whole-archive"


use variable such as PAIMON_LOCAL_FILE_SYSTEM_SHARED_LINK_LIBS

zjw1111 · 2026-05-28T10:24:46Z

+                         ${PAIMON_BENCHMARK_STATIC_LINK_LIBS}
+                         Threads::Threads
+                         ${CMAKE_DL_LIBS}
+                         rt


why need rt here? maybe it's cannot found at MacOS

zjw1111 · 2026-05-28T10:27:25Z

    resolve_dependency(ORC)
 endif()
+if(PAIMON_BUILD_BENCHMARKS)
+    resolve_dependency(Benchmark FIND_PACKAGE_NAME benchmark)


maybe dependency name can change from Benchmark to benchmark, then no need FIND_PACKAGE_NAME benchmark

zjw1111 · 2026-05-28T10:33:59Z

I found two issues in the current PR head:

[P1] benchmark/CMakeLists.txt unconditionally links GNU ld flags and rt (-Wl,--whole-archive, -Wl,--no-as-needed, -Wl,--as-needed, rt). This makes PAIMON_BUILD_BENCHMARKS=ON fail on non-Linux/ELF toolchains such as macOS, where those linker flags and librt are not available. Please guard these flags/libraries by platform or linker capability.
[P2] The PK/MOR benchmarks create tables with bucket=4, but MakeRecordBatch() always calls SetBucket(0), so all benchmark data is written into bucket 0. FileStoreWrite only validates that the bucket is in range; it does not re-hash rows by primary key. As a result, the PK/MOR benchmark layout does not represent a normal multi-bucket PK table. Please either distribute rows to buckets by PK hash or make the PK benchmarks explicitly use bucket=1.

The earlier checksum, URL_HASH, temp-directory cleanup, and help-exit concerns appear to be fixed in the current head.

lxy-9602 · 2026-05-29T05:58:32Z

+
+struct BenchmarkCliOptions {
+    std::string source_parquet;
+    std::string external_table_path;


Could source_parquet be renamed to source_data_file? I’m wondering whether we intend to support only Parquet as the source format here.

lxy-9602 · 2026-05-29T05:59:13Z

+
+bool HasHelpFlagImpl(int argc, char** argv) {
+    for (int i = 1; i < argc; ++i) {
+        const std::string arg(argv[i]);


int -> int32_t

lxy-9602 · 2026-05-29T06:04:09Z

+
+struct BenchmarkWorkspace {
+    explicit BenchmarkWorkspace(const std::string& prefix) {
+        std::error_code ec;


Can we use UniqueTestDirectory?

lxy-9602 · 2026-05-29T06:27:00Z

+    if (value == nullptr || std::strlen(value) == 0) {
+        return 4096;
+    }
+


Consider remove env.

lxy-9602 · 2026-05-29T06:35:09Z

+            continue;
+        }
+        batches.push_back(BuildStructArrayFromRecordBatch(record_batch));
+        total_rows += record_batch->num_rows();


Is all the data kept in memory before being written to Paimon? If so, this could lead to noticeable memory growth. It may also introduce significant memory access overhead during writing, which does not seem very consistent with the actual usage pattern.

lxy-9602 · 2026-05-29T06:38:33Z

+        .SetPrefetchMaxParallelNum(prefetch_parallel_num)
+        .EnableMultiThreadRowToBatch(false)
+        .SetRowToBatchThreadNumber(1);
+    auto read_ctx = ValueOrThrow(read_builder.Finish(), "create read context");


It would be better to make SetRowToBatchThreadNumber configurable, since in production it is usually set to 3 rather than 1.

lxy-9602 · 2026-05-29T06:52:05Z

+                              const std::string& option_name,
+                              std::vector<std::string>* columns_out) {
+    std::string parsed_value;
+    if (ConsumeCliOption(arg, option_name, &parsed_value)) {


Let’s keep output parameters consistently at the end and int->int32_t

lxy-9602 · 2026-05-29T06:54:15Z

+
+TEST(CliOptionParsingTest, ConsumeCliOptionWorks) {
+    std::string value;
+    EXPECT_TRUE(paimon::benchmark::ConsumeCliOption("--foo=bar", "--foo", &value));


Please prefer ASSERT* rather than EXPECT*

lxy-9602 · 2026-05-29T06:59:07Z

+}
+
+TEST(CliOptionParsingTest, ParseCsvColumnsWorks) {
+    const auto parsed = paimon::benchmark::ParseCsvColumns("id, name\tage", "--cols");


Why do we need to use CSV format here? If this is only for separation, would using , as the delimiter be sufficient?

lxy-9602 · 2026-05-29T07:01:43Z

 add_subdirectory(src/paimon/testing/mock)
 add_subdirectory(src/paimon/testing/utils)
 add_subdirectory(test/inte)
+add_subdirectory(benchmark)


Please also add a .rst for usage.

chore: add benchmark for append and pk table

13708a2

Copilot AI review requested due to automatic review settings May 26, 2026 01:56

Copilot started reviewing on behalf of lucasfang May 26, 2026 01:56 View session

Copilot AI reviewed May 26, 2026

View reviewed changes

lucasfang and others added 5 commits May 26, 2026 10:41

fix

9760fae

fix

19a724b

fix

2343a82

Merge branch 'main' into benchmark

d3f5835

Merge branch 'main' into benchmark

15be0f0

zjw1111 reviewed May 28, 2026

View reviewed changes

lxy-9602 reviewed May 29, 2026

View reviewed changes

Conversation

lucasfang commented May 26, 2026 • edited by lxy-9602 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

API and Format

Documentation

Generative AI tooling

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zjw1111 commented May 28, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lucasfang commented May 26, 2026 •

edited by lxy-9602

Loading