Monday, April 28, 2025

(Yet another) blog on monitoring Multi-threaded replication in MySQL

One of the most commonly misunderstood configurations in MySQL is related to parallel replication threads in MySQL, and more importantly how do I monitor my replica to ensure its configured effectively. 

Background on parallel replication

Traditionally, MySQL replication was single threaded where changes applied on a writer database are fetched by a replica and applied serially on a replica using a single applier thread, in the same order they are executed on a writer. Over time some optimizations were introduced to allow replicas apply non-dependent transactions/changes in parallel on a replica.

To decide on whether a change is dependent MySQL first introduced database level dependency tracking where changes applied to different databases could be applied in parallel on a replica. 

Then MySQL introduced the "logical clock". When an event is written to the binary log on a mysql database, whether a replica is connected or not, each transaction/event will also be associated with a a logical timestamp. The logical timestamp is calculated according to the type of dependency tracking configured on the source, and consists of two values, last_committed and sequence_number .

  • sequence_number - each transaction in a binlog file, starting with the first, is assigned a sequence number starting from 1 and monotonically increasing from there. In other words, the Nth transaction in a file will have sequence_number == N
  • last_committed - this field of the timestamp is the sequence number of the most recent transaction which the current transaction depends on. In other words, this is the last transaction which the current transaction cannot be applied in parallel with.

Before MySQL 5.7 there was one mechanism of determining dependencies based on logical timestamp, COMMIT_ORDER. The core concept here is that transactions can safely run in parallel if they held their locks simultaneously at commit time on the source. This concept is implemented by having each transaction track its dependencies through a commit_parent variable, which updates as locks are acquired. After each non-COMMIT statement, this variable is updated to match the latest transaction in the binlog, and at commit time, this value becomes the last_committed number. The mechanism works because once a transaction has executed its last pre-COMMIT statement, it holds all its needed locks. Any transactions that commit during this window can safely run in parallel on a replica since these transactions would have held their locks concurrently. This approach ensures data consistency while allowing parallel execution, providing a reliable mechanism for determining safe concurrent transactions, though the actual parallelism gains may be modest depending on workload. There are some caveats here, such as when a DDL is being executed, no other replica worker threads can be active which will basically block all other threads from scheduling work. All transactions are dependent on the DDL executing,

In this example, we have parsed the binary log using the mysqlbinlog tool.


Here we can see:
Statementlast_committedsequence_numberNote
create database test_db01This is the first statement in a new binary log file, so last committed is zero, and the first statement is assigned the sequence number of 1
create table test_db.t1(id int)12This is the second transaction/event in this binary log file so it is assigned a sequence number of 2, last committed is set to 1 as this is the sequence number of the previous transaction/event to which this is dependant on. Logically thinking, it would not make sense to try execute create table(sequence number 2) in a database which does not exist.(sequence number 1)

This new implementation (COMMIT_ORDER) was heavily dependent on the concurrency/throughput of the source system. The sequence numbers generated are heavily dependent on commit rate; where low-thread sources would commonly run into issues where dependency information was not granular enough. As a result, MySQL introduced a new mechanism to determine dependencies called WRITESET dependency tracking, which could be enabled via binlog_transaction_dependency_tracking. (WL#9556) Rather than relying on timing-dependent COMMIT_ORDER dependency tracking, WRITESET tracks actual data dependencies between transactions. When WRITESET is used for dependency tracking, MySQL creates transaction "writesets," which are collections of hashes. Each hash represents a modified database row on the MySQL source database. 

The algorithm tracks:

  • sequence_number (unique transaction ID)
  • last_committed (most recent dependent transaction)
  • Overlapping writesets (indicating dependencies)

Using this information, MySQL is able to determine dependencies based on writesets, rather than just locks held at commit time. Two transactions are deemed dependent if their writesets overlap. In the binary log from the source, last_committed now points to the most recent transaction with shared data. Using writesets, independent transactions can execute in parallel even if:

  • They were sequential on the source
  • Their group commit/commit windows didn't overlap
  • They were in different sessions
Source
The introduction of WRITESET provided huge improvements to parallelism of replica apply threads, even when write concurrency was low. As concurrency is increased on the source, and group commit windows become larger, the benefits may be less noticeable. This is demonstrated here from Oracle's blog on "Improving the Parallel Applier with Writeset-based Dependency Tracking" you can see WRITESET and COMMIT_ORDER perform similarly under high thread counts/parallelism on the source, WRITESET shows better performance in environments with lower concurrency. 

Since I didn't want to write "yet another" blog on dependency tracking I may have skipped over some details above but if you want to read more into this I would highly recommend reading Daniel Nichter's blog series here:

  1. Group commit and trx dependency tracking
  2. Replica preserve commit order (RPCO)
  3. Monitoring MTR lag with the Performance Schema
Hat tip to JFG too, the OG of replication deep dives and monitoring! He has many presentations and posts on the topic over the years which I would recommend checking out, they are also linked in Daniel's blog series.
 

New Performance Schema Replication Tables

In part 3 of Daniel's blog series he raised some key challenges with monitoring replication in MySQL, specifically how do I determine if I have over or under provisioned replication threads?(MTR thread utilization) and how do I determine if underutilization is to source throughput/dependencies(Waiting for Commit Order)? As he stated there is currently no view available in MySQL to see this information easily which got me excited when I seen the following in the 9.1 release notes:

  • Replication: This release adds the MySQL Replication Applier Metrics component, which provides users with statistical information about replication formerly logged in the error log. The component adds two tables containing this information to the MySQL Performance Schema: replication_applier_metrics provides replication applier metrics for a given replication channel, and replication_applier_progress_by_worker: This table provides similar metrics for a specific worker.

    This enhances observability of replication by gathering statistics from the entire replication pipeline, and unifying their presentation. As part of this work, some metrics which were not especially helpful have been replaced with more useful ones.

    For more information about this component, see Replication Applier Metrics Component. (WL #15620)

    References: See also: Bug #32587480.


Looking at the docs, this was the solution to our problem! But the devil is in the details, these new performance schema tables were implemented as a component which is only available in MySQL enterprise edition.  :(

This got me thinking, could we find this information elsewhere, and add some views to. create a "poor-mans" replication monitoring component?





Multi-threaded replica statistics

After some digging, it turns out that MySQL does actually expose information on MTS threads(5.3.2 Monitoring Replication Applier Worker Threads) and has done for a number of years now. In fact, Percona blogged about this in 2017
$ grep "Multi-threaded replica statistics" node1/data/msandbox.err | head -n 4
2025-01-27T18:07:46.516639Z 6 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel '': seconds elapsed = 250; events assigned = 13313; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 7565719600 waited (count) when Workers occupied = 44 waited when Workers occupied = 1489733600
2025-01-27T18:09:46.150048Z 6 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel '': seconds elapsed = 120; events assigned = 466945; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 55769034800 waited (count) when Workers occupied = 15175 waited when Workers occupied = 69850249500
2025-01-27T18:12:04.031062Z 16 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel '': seconds elapsed = 120; events assigned = 976897; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 103952309000 waited (count) when Workers occupied = 0 waited when Workers occupied = 0
2025-01-27T18:15:09.087390Z 38 [Note] [MY-010559] [Repl] Multi-threaded replica statistics for channel '': seconds elapsed = 120; events assigned = 1163265; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 111850198700 waited (count) when Workers occupied = 0 waited when Workers occupied = 0

You can see all definitions in the MySQL docs, but from first glance we can see this prints information on worker thread utilization(waited when Workers occupied) and clock conflicts! (Waited at clock conflicts). So whats the catch?

  1. This only gets printed when log_error_verbosity=3. Since MySQL defaults to 2, a lot of users probably have not seen this before.
    This is a dynamic parameter which can be changed online.
  2. When introduced, there were no views exposing this information at the database level, requiring grepping and wrangling of log files. This coupled with the log_error_verbosity default, probably led to this log message being neglected by MySQL operators over the years.
    Luckily, MySQL 8.0 changes this with the introduction of the performance_schema.error_log table. More on this below. 
  3. Messages are not printed in real time, as per the docs "The statistics are printed depending on the volume of events that the coordinator thread has assigned to applier worker threads, with a maximum frequency of once every 120 seconds". So we only have outputs at most, every two minutes.
    This shouldn't be an issue on most replication setups though, we should be generating enough replication events to have enough processed events every 2 minute window. 

New replication Views 

Now that we have a way of getting the information we can do a proof of concept to see if we can diagnose some causes for replication lag in MySQL, the poor man’s component.

As described above, from MySQL 8 we can use the performance_schema.error log table to query the error log directly from the database. To demonstrate this I will load some sysbench sample data and run some tests. After a couple of minutes with log_error_verbosity=3 on the replica, you should see some outputs.

slave1 [*********:23244] {msandbox} (performance_schema) > SELECT * FROM 
  performance_schema.error_log 
WHERE 
  SUBSYSTEM = 'Repl' 
  AND ERROR_CODE = 'MY-010559'
      AND DATA LIKE '%Multi-threaded % statistics for channel%'\G

*************************** 1. row ***************************
    LOGGED: 2025-04-25 21:28:05.406046
 THREAD_ID: 6
      PRIO: Note
ERROR_CODE: MY-010559
 SUBSYSTEM: Repl
      DATA: Multi-threaded replica statistics for channel '': seconds elapsed = 132; events assigned = 5121; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 883466700 waited (count) when Workers occupied = 8 waited when Workers occupied = 86657300
.
.
*************************** 30. row ***************************
    LOGGED: 2025-04-25 22:37:53.021052
 THREAD_ID: 16
      PRIO: Note
ERROR_CODE: MY-010559
 SUBSYSTEM: Repl
      DATA: Multi-threaded replica statistics for channel '': seconds elapsed = 120; events assigned = 58414081; worker queues filled over overrun level = 0; waited due a Worker queue full = 0; waited due the total size = 0; waited at clock conflicts = 316849996600 waited (count) when Workers occupied = 55876 waited when Workers occupied = 31639630000
30 rows in set (0.00 sec)

Now that we have a way of getting the log messages we can do some parsing of the data columns to present it in a neater way. To do this I've created a view called binlog_replication_coordinator_stats which you can see below in the appendix.

Column Description
timestamp When these statistics were logged
channel Replication channel name ('default' if unnamed channel)
seconds_elapsed Time period since last stats output in error log
events_assigned Number of events the coordinator assigned to worker threads
avg_events_assigned Average events assigned per second (events_assigned/seconds_elapsed)
queues_filled Number of times worker queues exceeded the overrun level in this period. High numbers indicate workers can't keep up with coordinator
waited_worker_queue_full Times coordinator waited because a worker's queue was full
waited_pending_total_size Times coordinator waited due to total pending jobs size limit
clock_conflict_waits Times coordinator waited due to commit order conflicts. High numbers can indicate:
1. Transactions touching same databases (when slave_parallel_type=DATABASE)
2. Write-set dependency conflicts (when slave_parallel_type=LOGICAL_CLOCK or WRITE_SET)
3. Commit order preservation overhead with slave_preserve_commit_order=ON:
- Workers must commit in same order as master binlog
- Earlier transaction in worker queue blocks later ones
- Critical for preventing data inconsistency
Can be influenced by:
- binlog-transaction-dependency-tracking setting on source (COMMIT_ORDER, WRITESET, or WRITESET_SESSION)
- slave_parallel_type setting on replica (DATABASE, LOGICAL_CLOCK, or WRITE_SET)
- transaction writeset contents
- transaction size and complexity
workers_occupied_count Number of times coordinator found all workers busy. Indicates potential need for more worker threads
workers_occupied_time Total microseconds waited for busy workers. Long waits suggest workers are the bottleneck

Example troubleshooting flow:

slave1 [*********:23244] {msandbox} (repl_mon) > pager grep "Seconds_Behind_Master"
PAGER set to 'grep "Seconds_Behind_Master"'
slave1 [*********:23244] {msandbox} (repl_mon) > show slave status\G
        Seconds_Behind_Master: 51
1 row in set, 1 warning (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > select * from binlog_replication_coordinator_stats;
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| timestamp                  | channel | seconds_elapsed | events_assigned | avg_events_assigned | queues_filled | waited_worker_queue_full | waited_pending_total_size | clock_conflict_waits | workers_occupied_count | workers_occupied_time |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| 2025-04-28 21:25:44.328627 | default |             345 |            1025 |                   3 |             0 |                        0 |                         0 |             24208500 |                      4 |               8697300 |
| 2025-04-28 21:27:44.012512 | default |             120 |         5208064 |               43401 |             0 |                        0 |                         0 |         100624635100 |                   8256 |            4027435500 |
| 2025-04-28 21:35:57.464912 | default |             493 |         4176896 |                8472 |             0 |                        0 |                         0 |          69082991700 |                   5879 |            2292737600 |
| 2025-04-28 21:37:57.020855 | default |             120 |         5331968 |               44433 |             0 |                        0 |                         0 |         100571333200 |                   7642 |            3408264300 |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
4 rows in set (0.00 sec)

From the above, we can see some waits on Workers occupied and also clock conflicts. In show slave status we can see Seconds_Behind_Master is 51 seconds. This is an indicator that I may have underprovisioned my @@replica_parallel_workers, but also that I may need to look into dependency tracking.

Since changing the replica_parallel_workers is low hanging fruit lets start there.

To test this out I will:

  1. Increase replica_parallel_workers to a higher value and restart the replication thread.
  2. Obseve replication lag to see if it decreases.
  3. Tune replica_parallel_workers further, or continue troubleshooting.

Step 1: Change replica_parallel_workers and restart replication:

slave1 [*********:23244] {msandbox} (repl_mon) > select @@replica_parallel_workers;
+----------------------------+
| @@replica_parallel_workers |
+----------------------------+
|                          4 |
+----------------------------+
1 row in set (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > set global replica_parallel_workers=20;
Query OK, 0 rows affected (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > select @@replica_parallel_workers;
+----------------------------+
| @@replica_parallel_workers |
+----------------------------+
|                         20 |
+----------------------------+
1 row in set (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > stop slave;start slave;
Query OK, 0 rows affected, 1 warning (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.09 sec)

Step 2: Wait for 4-5 minutes to observe changes in workers_occupied_count and seconds_behind_master After a couple of minutes we can see that workers_occupied_count dropped to zero which is great! But our replication lag has not improved.

slave1 [*********:23244] {msandbox} (repl_mon) > select * from binlog_replication_coordinator_stats;
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| timestamp                  | channel | seconds_elapsed | events_assigned | avg_events_assigned | queues_filled | waited_worker_queue_full | waited_pending_total_size | clock_conflict_waits | workers_occupied_count | workers_occupied_time |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| 2025-04-28 21:25:44.328627 | default |             345 |            1025 |                   3 |             0 |                        0 |                         0 |             24208500 |                      4 |               8697300 |
| 2025-04-28 21:27:44.012512 | default |             120 |         5208064 |               43401 |             0 |                        0 |                         0 |         100624635100 |                   8256 |            4027435500 |
| 2025-04-28 21:35:57.464912 | default |             493 |         4176896 |                8472 |             0 |                        0 |                         0 |          69082991700 |                   5879 |            2292737600 |
| 2025-04-28 21:37:57.020855 | default |             120 |         5331968 |               44433 |             0 |                        0 |                         0 |         100571333200 |                   7642 |            3408264300 |
| 2025-04-28 21:42:53.061556 | default |             120 |         4802561 |               40021 |             0 |                        0 |                         0 |          90936245100 |                      0 |                     0 |
| 2025-04-28 21:44:53.014055 | default |             120 |         5512192 |               45935 |             0 |                        0 |                         0 |         103869504400 |                      0 |                     0 |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
6 rows in set (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > pager grep "Seconds_Behind_Master"
PAGER set to 'grep "Seconds_Behind_Master"'
slave1 [*********:23244] {msandbox} (repl_mon) > show slave status\G
        Seconds_Behind_Master: 75
1 row in set, 1 warning (0.00 sec)

Step 3: Now we should look into dependency tracking changes. Here I will take Daniel's advice, and enable WRITESET on the source instance. This can be done online, so lets try:

master [*********:23243] {msandbox} ((none)) > set global binlog_transaction_dependency_tracking='writeset';
Query OK, 0 rows affected, 1 warning (0.00 sec)

master [*********:23243] {msandbox} ((none)) > select @@binlog_transaction_dependency_tracking,@@version;
+------------------------------------------+-----------+
| @@binlog_transaction_dependency_tracking | @@version |
+------------------------------------------+-----------+
| WRITESET                                 | 8.0.42    |
+------------------------------------------+-----------+
1 row in set, 1 warning (0.00 sec)

Now we need to wait a few minutes to allow changes to propagate. 

Remember that dependency tracking is applied on the source, so only binlog events from the time binlog_transaction_dependency_tracking is changed will use writeset to persist dependencies to the binary log, so we need to wait for replica to catch up to the change point to take advantage.

slave1 [*********:23244] {msandbox} (repl_mon) > show slave status\G
        Seconds_Behind_Master: 143
1 row in set, 1 warning (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > \n
PAGER set to stdout
slave1 [*********:23244] {msandbox} (repl_mon) > select * from binlog_replication_coordinator_stats;
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| timestamp                  | channel | seconds_elapsed | events_assigned | avg_events_assigned | queues_filled | waited_worker_queue_full | waited_pending_total_size | clock_conflict_waits | workers_occupied_count | workers_occupied_time |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| 2025-04-28 21:25:44.328627 | default |             345 |            1025 |                   3 |             0 |                        0 |                         0 |             24208500 |                      4 |               8697300 |
| 2025-04-28 21:27:44.012512 | default |             120 |         5208064 |               43401 |             0 |                        0 |                         0 |         100624635100 |                   8256 |            4027435500 |
| 2025-04-28 21:35:57.464912 | default |             493 |         4176896 |                8472 |             0 |                        0 |                         0 |          69082991700 |                   5879 |            2292737600 |
| 2025-04-28 21:37:57.020855 | default |             120 |         5331968 |               44433 |             0 |                        0 |                         0 |         100571333200 |                   7642 |            3408264300 |
| 2025-04-28 21:42:53.061556 | default |             120 |         4802561 |               40021 |             0 |                        0 |                         0 |          90936245100 |                      0 |                     0 |
| 2025-04-28 21:44:53.014055 | default |             120 |         5512192 |               45935 |             0 |                        0 |                         0 |         103869504400 |                      0 |                     0 |
| 2025-04-28 21:46:53.018936 | default |             120 |         5500928 |               45841 |             0 |                        0 |                         0 |         104048594800 |                      0 |                     0 |
| 2025-04-28 21:48:53.020615 | default |             120 |         5524480 |               46037 |             0 |                        0 |                         0 |         *********300 |                      0 |                     0 |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
8 rows in set (0.00 sec)

Step 4: Verify changes to binlog_transaction_dependency_tracking=writeset After a few minutes, we can now see replication lag has disappeared!

slave1 [*********:23244] {msandbox} (repl_mon) > pager grep "Seconds_Behind_Master"
PAGER set to 'grep "Seconds_Behind_Master"'
slave1 [*********:23244] {msandbox} (repl_mon) > show slave status\G
        Seconds_Behind_Master: 0
1 row in set, 1 warning (0.00 sec)

This is great, and shows the impact that clock conflicts were having on our replication throughput.

However, if we look at binlog_replication_coordinator_stats we can now see the bottleneck has moved back to the number of worker threads! 

Can we achieve more parallelism? Lets increase replica_parallel_workers again and see:

slave1 [*********:23244] {msandbox} (repl_mon) > select * from binlog_replication_coordinator_stats;
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| timestamp                  | channel | seconds_elapsed | events_assigned | avg_events_assigned | queues_filled | waited_worker_queue_full | waited_pending_total_size | clock_conflict_waits | workers_occupied_count | workers_occupied_time |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
| 2025-04-28 21:25:44.328627 | default |             345 |            1025 |                   3 |             0 |                        0 |                         0 |             24208500 |                      4 |               8697300 |
| 2025-04-28 21:27:44.012512 | default |             120 |         5208064 |               43401 |             0 |                        0 |                         0 |         100624635100 |                   8256 |            4027435500 |
| 2025-04-28 21:35:57.464912 | default |             493 |         4176896 |                8472 |             0 |                        0 |                         0 |          69082991700 |                   5879 |            2292737600 |
| 2025-04-28 21:37:57.020855 | default |             120 |         5331968 |               44433 |             0 |                        0 |                         0 |         100571333200 |                   7642 |            3408264300 |
| 2025-04-28 21:42:53.061556 | default |             120 |         4802561 |               40021 |             0 |                        0 |                         0 |          90936245100 |                      0 |                     0 |
| 2025-04-28 21:44:53.014055 | default |             120 |         5512192 |               45935 |             0 |                        0 |                         0 |         103869504400 |                      0 |                     0 |
| 2025-04-28 21:46:53.018936 | default |             120 |         5500928 |               45841 |             0 |                        0 |                         0 |         104048594800 |                      0 |                     0 |
| 2025-04-28 21:48:53.020615 | default |             120 |         5524480 |               46037 |             0 |                        0 |                         0 |         104187049300 |                      0 |                     0 |
| 2025-04-28 21:50:53.017684 | default |             120 |        13111296 |              109261 |             0 |                        0 |                         0 |          38211462000 |                 148868 |           34451537500 |
| 2025-04-28 21:52:53.016574 | default |             120 |         9456640 |               78805 |             0 |                        0 |                         0 |            265316200 |                  52396 |           14188590700 |
| 2025-04-28 21:54:53.014725 | default |             120 |         7398400 |               61653 |             0 |                        0 |                         0 |             43328300 |                   4493 |            3769165700 |
+----------------------------+---------+-----------------+-----------------+---------------------+---------------+--------------------------+---------------------------+----------------------+------------------------+-----------------------+
11 rows in set (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > select @@replica_parallel_workers;
+----------------------------+
| @@replica_parallel_workers |
+----------------------------+
|                         20 |
+----------------------------+
1 row in set (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > set global replica_parallel_workers=40;
Query OK, 0 rows affected (0.00 sec)

slave1 [*********:23244] {msandbox} (repl_mon) > stop slave;start slave;
Query OK, 0 rows affected, 1 warning (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.10 sec)

Step 5: Wait for 4-5 minutes to observe changes in workers_occupied_count

Now we can see results have improved, for fun we have even increased to 100 to see how things go. 

After doing so we can see the bottleneck swung back to clock_conflict_waits and workers_occupied_count dropped to zero.

slave1 [*********:23244] {msandbox} (repl_mon) > select * from binlog_replication_coordinator_stats;
[... previous output truncated for brevity ...]

# For fun, lets increase even more

slave1 [*********:23244] {msandbox} (repl_mon) > select @@replica_parallel_workers;
+----------------------------+
| @@replica_parallel_workers |
+----------------------------+
|                        100 |
+----------------------------+
1 row in set (0.00 sec)

Final thoughts:

  • Replication is a first-class citizen in MySQL, it would be nice to have native CE views for this information instead of relying on error log hacks.
  • The performance_schema.error_log table is limited in size, so I would suggest periodically publishing metrics to your preferred monitoring solution so logs are not deleted or rotated accidentally. 
  • log_error_verbosity is a dynamic variable, so if replication logging is too noisy for your liking, you can only enable when needed but historical info can be useful in such investigations; its a tradeoff :). 
  • Starting in MySQL 8.4, writeset is enabled by default(only config allowed), for 8.0 you should really look into enabling it.
  • Parallel replication threads defaults to 4 from MySQL version 8.0.27. Prior to this it was disabled by default.
  • Tuning of worker threads(and replication in general!) should be done with caution, there are always trade offs and you should optimize based on your requirements.
    In the above example, while provisioning 100 replica_parallel_workers removed the bottleneck on worker threads, my problem was solved at ~20 threads.
    Be very careful that you are not shifting deck chairs around the titanic. Having sensible values, can help avoid fires elsewhere and act as a throttling mechanism, you don't want to saturate other parts of your system unnecessarily at peak times.
  • Its not always the replication configuration, and you should be careful, and verify while changing. For example: 
    • There can be other bottlenecks such as missing primary keys, DDLs and long running transactions blocking replication applier threads which can lead to lag too. See Daniel's blog for advice on monitoring these. I will try do a follow up here too if I can get time. 
    • Increasing replica_parallel_workers can introduce bottlenecks elsewhere, in my case I started to see the following messages in the error log indicating my redo log size was not adequate. So be careful to test any such changes at peak and regular load.
slave1 [*********:23244] {msandbox} (repl_mon) > select * from performance_schema.error_log where error_code='MY-013865' order by 1 desc limit 4;
+----------------------------+-----------+---------+------------+-----------+---------------------------------------------------------------------------------------------------+
| LOGGED                     | THREAD_ID | PRIO    | ERROR_CODE | SUBSYSTEM | DATA                                                                                              |
+----------------------------+-----------+---------+------------+-----------+---------------------------------------------------------------------------------------------------+
| 2025-04-28 22:12:12.303168 |         0 | Warning | MY-013865  | InnoDB    | Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity. |
| 2025-04-28 22:11:59.514332 |         0 | Warning | MY-013865  | InnoDB    | Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity. |
| 2025-04-28 22:11:47.038615 |         0 | Warning | MY-013865  | InnoDB    | Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity. |
| 2025-04-28 22:11:39.048435 |         0 | Warning | MY-013865  | InnoDB    | Redo log writer is waiting for a new redo log file. Consider increasing innodb_redo_log_capacity. |
+----------------------------+-----------+---------+------------+-----------+---------------------------------------------------------------------------------------------------+
4 rows in set (0.01 sec)

Appendix

View Definition

Below is the definition for the view used in this blog post. Use/test with caution, created for demonstration purposes :) 

CREATE OR REPLACE VIEW repl_mon.binlog_replication_coordinator_stats AS
WITH repl_stats AS (
  SELECT 
    LOGGED as timestamp,
    SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'channel \'', -1), '\':', 1) as channel,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'seconds elapsed = ', -1), ';', 1) AS UNSIGNED) as seconds_elapsed,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'events assigned = ', -1), ';', 1) AS UNSIGNED) as events_assigned,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'worker queues filled over overrun level = ', -1), ';', 1) AS UNSIGNED) as queues_filled,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'waited due a Worker queue full = ', -1), ';', 1) AS UNSIGNED) as waited_worker_queue_full,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'waited due the total size = ', -1), ';', 1) AS UNSIGNED) as waited_pending_total_size,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'waited at clock conflicts = ', -1), ' waited', 1) AS UNSIGNED) as clock_conflict_waits,
    CAST(SUBSTRING_INDEX(SUBSTRING_INDEX(DATA, 'waited (count) when Workers occupied = ', -1), ' waited', 1) AS UNSIGNED) as workers_occupied_count,
    CAST(SUBSTRING_INDEX(DATA, 'when Workers occupied = ', -1) AS UNSIGNED) as workers_occupied_time
  FROM performance_schema.error_log
  WHERE 
    SUBSYSTEM = 'Repl' 
    AND ERROR_CODE = 'MY-010559'
    AND DATA LIKE '%Multi-threaded % statistics for channel%'
)
SELECT 
  timestamp,
  CASE WHEN channel = '' THEN 'default' ELSE channel END as channel,
  seconds_elapsed,
  events_assigned,
  ROUND(events_assigned/seconds_elapsed) as avg_events_assigned,
  queues_filled,
  waited_worker_queue_full,
  waited_pending_total_size,
  clock_conflict_waits,
  workers_occupied_count,
  workers_occupied_time
FROM repl_stats;

Quick notes on common Issues and Interpretations.

feel free to add to comments as you see them!

  • High waited_worker_queue_full: Consider increasing replica_pending_jobs_size_max
  • High waited_pending_total_size: Check overall replication memory usage and limits
  • High clock_conflict_waits: Could be improved by:
    • Using WRITESET dependency tracking on master
    • Using LOGICAL_CLOCK or WRITE_SET parallel type on dependency tracking instead of DATABASE
    • Better database partitioning (if using DATABASE parallel type)
    • Optimizing transaction batching

    Note: If slave_preserve_commit_order=ON:

    • Conflicts are necessary for consistency
    • Disabling it may improve performance but:
      • Can break data consistency
      • May cause temporary read inconsistencies
      • Not suitable if applications depend on commit order
    • See https://hackmysql.com/replica-preserve-commit-order/ for more details
  • High workers_occupied counts/times: Consider increasing replica_parallel_workers

Related Configuration Variables

  • replica_parallel_workers: Number of worker threads (per channel)
  • replica_parallel_type: DATABASE, LOGICAL_CLOCK, or WRITE_SET
  • replica_preserve_commit_order: Affects commit order preservation
  • replica_pending_jobs_size_max: Per-worker queue size limit
  • binlog_transaction_dependency_tracking: How master tracks dependencies (COMMIT_ORDER, WRITESET, or WRITESET_SESSION)

Write-set Based Dependency Tracking

Write-set based dependency tracking can significantly reduce conflicts by:

  • Tracking actual row-level write conflicts instead of schema-level
  • Allowing parallel execution of non-conflicting transactions
  • Supporting better transaction scheduling
  • Reducing false dependencies in LOGICAL_CLOCK mode

Note: Each replication channel has its own coordinator thread and worker threads, allowing different parallel replication configurations per channel.

Related Reading

Version Information

The examples in this blog post were tested with:

  • MySQL Version: 8.0.42

Last updated: April 2024

No comments:

Post a Comment

(Yet another) blog on monitoring Multi-threaded replication in MySQL

One of the most commonly misunderstood configurations in MySQL is related to parallel replication threads in MySQL, and more importantly how...