Best Practices for Configuring MySQL Replication

MySQL provides several replication configuration options. However, ensuring it is done correctly may take time and effort, with considerable choices. Replication is a crucial initial step to enhance availability in MySQL databases. A properly designed replication architecture can significantly impact the accessibility of your data and prevent potential management complications. This article will delve into best practices for replication, covering local and long-distance scenarios.

Active and Passive Setups

When using active/passive mode for replication in MySQL, one server acts as the source while others serve as read-only replicas. The replicas can handle read-only queries, but all writes must be directed to the source. This configuration distributes the load across all replicas. Still, in the default asynchronous replication mode, there might be a delay between when data is written to the source and when it’s available on the replica. Keep this in mind while designing your application.

Alternatively, the active/active configuration allows multiple servers to handle both read and write operations. While it may seem beneficial to have two servers for write requests, each server still needs to process the query workload of the other, making write distribution less effective. Additionally, MySQL lacks native conflict resolution logic, so conflicts can quickly arise, making it challenging to determine the true source of data without significant data loss when a rebuild is necessary.

I recommend using the active/passive configuration for replication and sharding if you require higher throughput from your database.

Enable GTIDs

Global transaction identifiers (GTIDs) are unique IDs attached to transactions in a replicated environment. By default, replicas track processed records in the binary log file of a source database. However, this method is fragile if the source crashes and logs need to be restored.

With GTIDs enabled, each transaction receives a UUID-based ID, allowing replicas to determine if a transaction has been processed. The GTID is a UUID representing the source server, followed by an auto-incrementing ID (e.g., 14a54b2f-2ad0-43b6-b803-72b5d7151d3b:1). As transactions occur on the source, the UUID remains the same while the ID increases. When a replica processes a transaction, the GTID or GTID range (e.g., 14a54b2f-2ad0-43b6-b803-72b5d7151d3b:1-10) is stored in the gtid_executed table.

This process significantly reduces the risk of data becoming unsynchronized due to unprocessed or excessively processed transactions.

Replication Modes for Optimal Configuration

MySQL offers two replication modes that behave differently and should be understood to ensure optimal configuration for your environment. By default, MySQL employs asynchronous replication, where transactions are sent to the source, read by replicas and processed independently without source validation.

Alternatively, MySQL supports semi-synchronous replication, enabled as a plugin. In this mode, transactions are received by the source and processed by replicas, but the source waits for at least one replica to accept the transaction before responding. Semi-synchronous replication provides greater data consistency, though it introduces some overhead in response time.

With semi-sync mode, there are additional options that can be adjusted. By default, the primary server waits 10 seconds for a replica with semi-sync mode enabled to acknowledge the transaction. If data consistency relies on semi-sync, consider increasing this timeout value to ensure guaranteed consistency.

It’s important to note that these modes can be mixed and matched. For instance, you can configure one replica with semi-sync and another without, ensuring that one specific server always has an up-to-date copy of your database while other replicas catch up at their own pace. This approach aids disaster recovery by easily identifying the best candidate for recovery.

Understanding the distinctions between these modes will help you make informed decisions that align with your business needs.

Optimize Replica Instances With Separate Log Storage

Replica instances in your environment rely on the binary logs of the source for accurate information, as mentioned earlier. These logs are stored on the same disk as the database by default. However, this setup can overwhelm busy databases due to high throughput, including database manipulation and reading binary logs for replication. To address this, storing the binary logs on a separate disk from the database is recommended.

This approach offers performance benefits and cost savings in cloud environments where free volumes have strict IOPS limits.

Monitor Replication and Metrics in Managed Clusters

Monitoring is crucial for all infrastructure, including replication. With monitoring, it is possible to know if data is being replicated after configuration. SolarWinds database performance (formerly VividCortex) and Prometheus are popular solutions for monitoring replication and other metrics in managed clusters. Regardless of the chosen solution, it is essential to promptly notify the appropriate individuals when issues arise and address them before they escalate.

Create a Failover Plan for Minimizing Disruptions

Inevitable software or hardware failures necessitate failover planning to minimize potential disruptions. Utilizing replication offers resilience by ensuring multiple servers house your online data simultaneously. Therefore, your team must possess a well-prepared strategy should the primary data source fail. The following example outlines an unplanned failover process:

● Ensure the downed source remains offline to avoid unexpected replication issues.
● Identify and remove the read_only option from the chosen replica that will serve as the new data source. If using semi-sync, select the replica previously configured with the plugin and source.
● Update your application to direct queries to the newly promoted source.
● Modify the remaining replicas to start replicating from the new source.

Optimize Cross-Region Replication

Cloud providers typically provide services across multiple geographic regions comprising several data centers. These data centers, known as availability zones (AZs), are strategically located within the same region but far enough to withstand disasters. Replicating databases to different physical locations is recommended but requires careful consideration. Replication introduces additional latency due to the distance the data needs to travel. Fortunately, latency between AZs is generally low, with AWS claiming single-digit millisecond latency within the same region.

In contrast, there is significant latency between regions. For example, at the time of writing, the latency between us-east-1 and us-west-1 was reported to be over 60ms by Cloudping.co. Replication across regions results in replication lag, which is the time difference between data being written to the source and being written to a replica. This lag is even more pronounced when replicating across longer distances.

Asynchronous mode should be used for replication across regions to avoid unnecessary delays for applications making requests.

A comprehensive understanding of replication configuration options is crucial. However, it is equally important to determine the optimal configuration for these options. This can significantly impact your MySQL cluster’s operational efficiency and data-serving capabilities.