feat: multi-threaded slave (MTS) parallel DML apply#1692
Conversation
|
Thanks for the PR @jackiesre721. When we attempted MTS before we ran into data consistency issues. If possible could you run some integration tests and tests under high query load to increase confidence in correctness? If I have time I will run some tests as well |
7147d84 to
ab784af
Compare
Implements LOGICAL_CLOCK-based parallel binlog event application, mirroring MySQL 5.7 MTS scheduling. With --num-workers=N, gh-ost applies DML events to the ghost table using N concurrent workers, significantly increasing throughput for high-write tables. Key components: - commitBarrier: dependency tracking via last_committed/sequence_number - mtsScheduleState: new-group detection and epoch reset handling - dmlCoordinator: transaction grouping and dependency-aware dispatch - dmlWorker: per-worker goroutine with independent DB connection - Deadlock-aware retry: immediate retry on errno 1213, 1s sleep on others - Monotonic coordinate update: prevents checkpoint regression when workers complete out of order Backward compatible: --num-workers=1 (default) uses the original single-threaded path with zero behavioral changes.
ab784af to
c4060b9
Compare
Replace naive gap-free LWM with dispatched-subsequence tracking. gh-ost sees only one table's binlog events, so sequence_numbers are sparse (5, 9, 14, ...). A gap-free LWM would stall at the first gap. Key changes: - Track dispatched sequence numbers; LWM advances over the committed prefix of that subsequence - Detect cross-table dependencies via dispatched set membership (replaces explicit parentSeenOnStream bool) - Fix sync.Cond.Wait() not responding to context cancellation by spawning a watcher goroutine that calls Broadcast() on ctx.Done() - Guard delegatedJobs under mu to prevent lost-wakeup deadlocks - Add delegatedJobCount() and reset() helpers Fixes the 16-worker deadlock where concurrent deadlock retries caused all workers to block in waitForDependency indefinitely.
The CI runs localtests/test.sh without -g flag, so --gtid must be in the per-test extra_args file for the MTS test to activate GTID mode and use logical timestamps for dependency tracking.
|
Hi @meiji163, great questions. Here is a detailed summary of the testing we have done and the specific consistency issues we identified and fixed. PR #1454 Root Cause AnalysisPR #1454 previously had data consistency issues. We traced the root cause to two bugs: 1. Naive gap-free LWM stalls on sparse sequence numbersgh-ost only streams binlog rows for the single migrated table, but MySQL assigns Fix: Replaced gap-free LWM with dispatched-subsequence tracking. The coordinator records every transaction it dispatches via 2. Cross-table dependency false blockingPR #1454 tracked whether a parent transaction was "seen on stream" via an explicit boolean. This had an observe/dispatch ordering hazard. Fix: Cross-table dependencies are now detected directly from the dispatched set: if 3.
|
Implements LOGICAL_CLOCK-based parallel binlog event application, mirroring MySQL 5.7 MTS scheduling. With --num-workers=N, gh-ost applies DML events to the ghost table using N concurrent workers, significantly increasing throughput for high-write tables.
Key components:
Backward compatible: --num-workers=1 (default) uses the original single-threaded path with zero behavioral changes.
A Pull Request should be associated with an Issue.
Related issue: https://github.com/github/gh-ost/issues/0123456789
Description
This PR [briefly explain what it does]
script/cibuildreturns with no formatting errors, build errors or unit test errors.