feat(benchmark): add append/pk table benchmark#302
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces an optional Google Benchmark-based performance suite for Paimon (append-table write/read and PK-table write/MOR read), along with CMake/third-party wiring to build and run the benchmarks via CTest labels.
Changes:
- Add a
PAIMON_BUILD_BENCHMARKSbuild option and a benchmark CTest label/target integration. - Vendor/resolve Google Benchmark as a dependency (bundled/system) and add a
FindbenchmarkAlt.cmakemodule. - Add a benchmark executable with custom CLI parsing and shared helper utilities + unit tests for CLI parsing.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/versions.txt | Adds Google Benchmark version metadata for bundled dependency download. |
| CMakeLists.txt | Adds PAIMON_BUILD_BENCHMARKS option + benchmark CTest target/labels and subdir wiring. |
| cmake_modules/ThirdpartyToolchain.cmake | Adds Benchmark dependency resolution and bundled build rule. |
| cmake_modules/FindbenchmarkAlt.cmake | Adds a “system” find module for Google Benchmark. |
| cmake_modules/DefineOptions.cmake | Adds PAIMON_BUILD_BENCHMARKS and Benchmark_SOURCE options. |
| cmake_modules/BuildUtils.cmake | Adds add_paimon_benchmark / add_benchmark_case helpers and CTest labeling. |
| benchmark/CMakeLists.txt | Defines benchmark executable target and CLI parsing unit test target. |
| benchmark/read_write_benchmark.cpp | Benchmark entrypoint with custom CLI parsing + Google Benchmark init/run. |
| benchmark/cli_option_parsing.h | Inline parsing helpers for custom benchmark CLI options. |
| benchmark/cli_option_parsing_test.cpp | GTest coverage for CLI parsing helpers. |
| benchmark/benchmark_suite.h | Declares benchmark runner functions and CLI helpers. |
| benchmark/benchmark_suite.cpp | Implements benchmark suite: table setup, write/commit, read iterations, caching, CLI handling. |
| benchmark/benchmark_helpers.h | Declares shared validation/skip and read-iteration helpers. |
| benchmark/benchmark_helpers.cpp | Implements validation/skip behavior and shared read-iteration runner. |
| benchmark/benchmark_case_write.cpp | Registers BM_Write benchmark. |
| benchmark/benchmark_case_read.cpp | Registers BM_Read benchmark variants. |
| benchmark/benchmark_case_pk_write.cpp | Registers BM_PK_Write benchmark. |
| benchmark/benchmark_case_mor_read.cpp | Registers BM_MOR_Read benchmark variants. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| set(PAIMON_BENCHMARK_STATIC_LINK_LIBS | ||
| paimon_shared | ||
| "-Wl,--whole-archive" |
There was a problem hiding this comment.
use variable such as PAIMON_LOCAL_FILE_SYSTEM_SHARED_LINK_LIBS
| ${PAIMON_BENCHMARK_STATIC_LINK_LIBS} | ||
| Threads::Threads | ||
| ${CMAKE_DL_LIBS} | ||
| rt |
There was a problem hiding this comment.
why need rt here? maybe it's cannot found at MacOS
| resolve_dependency(ORC) | ||
| endif() | ||
| if(PAIMON_BUILD_BENCHMARKS) | ||
| resolve_dependency(Benchmark FIND_PACKAGE_NAME benchmark) |
There was a problem hiding this comment.
maybe dependency name can change from Benchmark to benchmark, then no need FIND_PACKAGE_NAME benchmark
|
I found two issues in the current PR head:
The earlier checksum, |
|
|
||
| struct BenchmarkCliOptions { | ||
| std::string source_parquet; | ||
| std::string external_table_path; |
There was a problem hiding this comment.
Could source_parquet be renamed to source_data_file? I’m wondering whether we intend to support only Parquet as the source format here.
|
|
||
| bool HasHelpFlagImpl(int argc, char** argv) { | ||
| for (int i = 1; i < argc; ++i) { | ||
| const std::string arg(argv[i]); |
|
|
||
| struct BenchmarkWorkspace { | ||
| explicit BenchmarkWorkspace(const std::string& prefix) { | ||
| std::error_code ec; |
There was a problem hiding this comment.
Can we use UniqueTestDirectory?
| if (value == nullptr || std::strlen(value) == 0) { | ||
| return 4096; | ||
| } | ||
|
|
| continue; | ||
| } | ||
| batches.push_back(BuildStructArrayFromRecordBatch(record_batch)); | ||
| total_rows += record_batch->num_rows(); |
There was a problem hiding this comment.
Is all the data kept in memory before being written to Paimon? If so, this could lead to noticeable memory growth. It may also introduce significant memory access overhead during writing, which does not seem very consistent with the actual usage pattern.
| .SetPrefetchMaxParallelNum(prefetch_parallel_num) | ||
| .EnableMultiThreadRowToBatch(false) | ||
| .SetRowToBatchThreadNumber(1); | ||
| auto read_ctx = ValueOrThrow(read_builder.Finish(), "create read context"); |
There was a problem hiding this comment.
It would be better to make SetRowToBatchThreadNumber configurable, since in production it is usually set to 3 rather than 1.
| const std::string& option_name, | ||
| std::vector<std::string>* columns_out) { | ||
| std::string parsed_value; | ||
| if (ConsumeCliOption(arg, option_name, &parsed_value)) { |
There was a problem hiding this comment.
Let’s keep output parameters consistently at the end and int->int32_t
|
|
||
| TEST(CliOptionParsingTest, ConsumeCliOptionWorks) { | ||
| std::string value; | ||
| EXPECT_TRUE(paimon::benchmark::ConsumeCliOption("--foo=bar", "--foo", &value)); |
There was a problem hiding this comment.
Please prefer ASSERT* rather than EXPECT*
| } | ||
|
|
||
| TEST(CliOptionParsingTest, ParseCsvColumnsWorks) { | ||
| const auto parsed = paimon::benchmark::ParseCsvColumns("id, name\tage", "--cols"); |
There was a problem hiding this comment.
Why do we need to use CSV format here? If this is only for separation, would using , as the delimiter be sufficient?
| add_subdirectory(src/paimon/testing/mock) | ||
| add_subdirectory(src/paimon/testing/utils) | ||
| add_subdirectory(test/inte) | ||
| add_subdirectory(benchmark) |
There was a problem hiding this comment.
Please also add a .rst for usage.
Purpose
Linked issue: close #100
Add a Google Benchmark based performance suite covering:
Add a benchmark entrypoint and custom CLI parsing for:
Add benchmark helper utilities for:
Build integration updates:
Tests
UT:
Benchmark smoke:
IT:
API and Format
Documentation
Generative AI tooling
Generated-by: GitHub Copilot (GPT-5.3-Codex)