Distributed SQL Mariadb Xpand Architecture - Whitepaper - 1106
Distributed SQL Mariadb Xpand Architecture - Whitepaper - 1106
SQL: THE
ARCHITECTURE
BEHIND
MARIADB XPAND
April 2021
WHITEPAPER
TABLE OF CONTENTS
1 INTRODUCTION
2 PROPERTIES
2 LOGICAL
2 DISTRIBUTED
2 TRANSACTIONAL
2 AVAILABLE
3 CONSISTENT
3 ELASTIC
4 XPAND ARCHITECTURE
4 SLICES
6 INDEXES
7 REPLICAS
7 CACHING
8 QUERY PATH
12 TRANSACTIONS
12 HIGH AVAILABILITY
14 SCALABILITY
17 CONCLUSION
Early on, these scalability challenges were addressed by manually sharding MariaDB and MySQL databases, resulting in
custom solutions created for specific applications, services and queries. While sharding adds scalability, it is complex,
brittle, difficult to maintain and places limitations on how data can be queried. It may not be a problem for businesses
with virtually unlimited engineering resources, but for everyone else, it’s neither preferred nor practical.
Then there were multimillion-dollar hardware appliances like Oracle Exadata… a simpler alternative if cost was not a
factor. And finally, the emergence of NoSQL databases like Apache Cassandra. They provided scalability, but at the cost
of data integrity, consistency and a standard query language.
All of these approaches were workarounds that required businesses to give up something in order to get scalability,
whether it was flexibility, data integrity or millions of dollars. However, there is now a proper solution – distributed SQL.
In the past, relational databases were limited to replication for read scaling and scaling up (i.e., moving to bigger and
bigger servers) for write and storage scaling. Thus the need for sharding, hardware appliances or NoSQL.
Distributed SQL databases are relational databases engineered with a distributed architecture to scale out on
commodity hardware, on premises or in the cloud. However, there are no trade-offs. They perform ACID transactions
and support standard SQL like standard relational databases, but they can do it at a much greater scale.
MariaDB Xpand is a distributed SQL database built for scalability and performance. It can handle everything from
thousands of transactions per second to millions – all while maintaining sub-millisecond latency. In addition, MariaDB
Xpand implements the MySQL protocol and supports standard MariaDB and MySQL connectors, making it easy to add
scalability to applications running on MariaDB and MySQL without rewriting them.
This white paper describes the architecture of MariaDB Xpand, and explains how it scales out while at the same time
maintaining data integrity and strong consistency with ACID transactions.
In a distributed SQL database, no database instance is a complete database itself. When an application connects to a
database instance within a distributed SQL database, that database instance represents all of the database instances
combined (i.e., a logical database). As a result, applications can connect to any database instance to read and write
data.
Distributed
All of the database instances within a replicated database maintain a full copy of the data and execute queries on their
own, independent of any other database instance, whereas each database instance within a distributed SQL database
maintains a subset of the data and has other database instances participate in query execution when needed.
The tables within a distributed SQL database are divided into groups of rows with different groups stored on different
database instances. As a result, the rows within a table are spread evenly across all database instances. However, any
database instance can read or write rows stored on any other. If the results of a query contain rows stored on multiple
database instances, query execution will be expanded to include them.
Transactional
Like replicated databases, distributed SQL databases use ACID transactions, and like queries, transactions can span
multiple database instances. If a write modifies rows stored on multiple database instances, those database instances
will participate in a single distributed transaction – either all of the rows are modified, or none of them are.
Available
Replicated databases provide high availability with automatic failover, whereby if the primary database instance fails,
one of the replicas is promoted to be the new primary. Applications may be unable to write data until the promotion is
complete.
Distributed SQL databases provide continuous availability because there is no primary – every database instance is
capable of writing data (and does). In order to ensure all data remains available after a database instance has failed,
distributed SQL databases store copies of rows on multiple database instances. If a database instance fails, its data
remains available through copies of its rows stored on other database instances.
This is not a concern with distributed SQL databases as writes to rows stored on multiple nodes are performed
synchronously within transactions, ensuring strong consistency and preventing stale reads.
Elastic
Adding replicas to replicated databases scales reads, but it’s not necessarily easy or fast – and it often can’t be done
on demand. In many cases, new replicas must be created from backups and caught up via replication before they can
be used. In terms of writes and storage, it is necessary to scale up or down by creating a new database instance on a
bigger or smaller server – and often using the same approach as adding a new replica (i.e., creating a new database
instance from a backup).
Distributed SQL databases are intended to be scaled on demand, allowing nodes to be added or removed in production
without disrupting applications. When nodes are added or removed, distributed SQL databases will automatically
rebalance the data in order to maintain an even distribution of data, and in turn, an even distribution of the workload
itself. This makes distributed SQL databases particularly effective for applications with volatile workloads or significant
but temporary peaks (e.g., e-commerce and Black Friday).
In addition, distributed SQL databases are designed to be scaled out and back. Rather than having to upgrade or
downgrade servers, a time-consuming and disruptive process, nodes can be added or removed on demand in order to
increase or decrease capacity on the fly – and without incurring any downtime.
In the following example, the books table in a three-node cluster (tbl_books) has three slices by default, with each slice
stored on a separate node.
Figure 1
Rows are mapped to slices using a distribution key. By default, it is the first column of the primary key. Optionally, it
can be the first n columns of the primary key if it is a composite key, with n ranging from one to the total number of
columns in the primary key.
The distributed key is hashed into a 64-bit number using a consistent hashing algorithm, with each slice assigned a
range (i.e., min/max). In order to determine which slice a row is stored in, the hash of its distribution key is compared
against the range of each slice.
The example below shows min/max values for three slices, and for simplicity, assumes the maximum value of an
unsigned short.
1 0 21845
2 21846 43690
3 43691 65535
Continuing the example above, different rows within the books table are stored in separate slices based on the hash of
their primary key (id).
tbl_books, slice 1
tbl_books, slice 2
tbl_books, slice 3
In the following example, a secondary index on the author column is created for the books table (idx_author). It is
divided into three slices by default, with each slice stored on a separate node.
Figure 2
Like table rows, index entries are mapped to slices using a distribution key and consistent hashing. By default, the
distribution key is the first column of the index. Optionally, it can be the first n columns of a composite index, with n
ranging from one to the total number of columns in the index.
Continuing the example above, different entries within the index are stored in separate slices based on the hash of
their column(s).
idx_author, slice 1
Iain M. Banks 1
Alastair Reynoldss 7, 8, 8
idx_author, slice 2
William Gibson 2
R. E. Stearnss 9, 8
idx_author, slice 3
Peter F. Hamilton 3, 4, 5
Cixin Liu 68
In the following example, there are two replicas of the books table and author index slices.
Figure 3
Caching
Xpand, like any other database, caches as much data as possible in memory. However, unlike many others, it uses a
distributed cache – effectively combining the memory of multiple nodes to create a single cache with no duplicate
entries. When slices have multiple replicas, one of them is designated the ranking replica. This replica is used for reads
while the others are there for high availability. And because nodes only cache their ranking replicas (of which there is
only one), there are no duplicates – maximizing the amount of data the entire cluster can cache.
In the following example, ranking replicas are cached in memory while all replicas, including ranking replicas, are stored
on disk. If needed, simply add nodes to ensure all ranking replicas can be stored in memory for low-latency reads.
Query path
Queries can be routed to any node within a cluster because every node is aware of how the data is distributed, and can
determine which nodes need to participate in a query and can forward the query (or part of it) to them.
Reads
Primary key
If a table is queried by its primary key, the Xpand node that receives the query hashes the primary key and determines
which node has the table slice the row is stored in. The receiving node then forwards the query to this node and returns
the results.
MariaDB MaxScale is an advanced database proxy. It abstracts away the database infrastructure, making it look like
a standalone database regardless of whether there are 3 nodes or a 100. In the example below, MaxScale routes the
query to Node 2. Node 2 then determines the row with id=1 is stored in slice 3 based on its hash value (60000), with
replicas of slice 3 stored on Node 1 and Node 3.
0 21845 1 1 (rr), 2
Node 2 will then forward the query to Node 3 because it has the ranking replica, and return the results to MaxScale
(which returns the results to the client).
Figure 5
Secondary indexes
If a table is queried using a secondary index, the Xpand node that receives the query hashes the index key and
determines which node has the index slice the key is stored in.
MaxScale forwards the query to Node 2. Node 2 then determines the index key ‘Alastair Reynolds’ is stored in Slice 1
based on its hash value (5000), with replicas of slice 1 stored on Node 1 and Node 2.
0 21845 1 1 (rr), 2
Node 2 then looks up the primary keys of matching rows by checking Slice 1 on Node 1 because it is the ranking replica.
idx_author
Alastair Reynolds 7, 8
Finally, Node 2 determines rows with id=7 and id=8 are stored in Slice 2 and Slice 3 based on their hashes, and then
forwards the query to Node 2 (itself) and Node 3 because they have the ranking replicas of Slice 2 and Slice 3.
Figure 6
In the same way reads use consistent hashing to determine which node(s) queries using primary keys or secondary
indexes should be forwarded to, so do writes.
MaxScale forwards the query to Node 2. Node 2 then determines the row with id=9 is stored in Slice 3 based on its hash
value (64000), with replicas of Slice 3 stored on Node 1 and Node 3.
tbl_books slices
0 21845 1 1 (rr), 2
Node 2 then forwards the query to all nodes with replicas of Slice 3, updating the row on Node 1 and Node 3 at the
same time.
Figure 7
When a node receives a transaction, it becomes the transaction coordinator for it. The transaction is assigned a
transaction id (xid), and each subsequent statement within the transaction is assigned an invocation id (iid). When
rows are modified, write locks are obtained for them and new MVCC entries are created. The write locks do not block
reads, only other writes. If a write fails because it cannot obtain a write lock, the transaction will be rolled back.
To commit the transaction, the transaction coordinator initiates the PREPARE phase. Next, during the ACCEPT phase,
it selects three random nodes to persist the transaction state and waits for a consensus using Paxos – ensuring
the transaction can be completed even if the transaction coordinator fails. Finally, during the COMMIT phase, all
participating nodes persist their MVCC entries and remove any write locks.
In terms of consistency, as mentioned previously, Xpand defaults to SNAPSHOT ISOLATION. In terms of ANSI SQL, it is
similar to REPEATABLE READ but does not exhibit phantom reads because all reads within a transaction will use the last
committed data prior to the start of it.
High availability
Node failure
If a node fails, the cluster will automatically detect it and perform a group change. After the group change, the
rebalancer will automatically replace the lost ranking replicas on the failed node by promoting replicas on the remaining
nodes to be ranking replicas.
In addition, the rebalancer will replace any lost replicas on the failed node by using the remaining ranking replicas to
create new ones.
Node 3 has the ranking replica for Slice 3. If Node 3 fails, the replica of Slice 3 on Node 1 will be promoted to ranking
replica.
Figure 8
Node 3 has a replica of Slice 2. While the ranking replica remains on Node 2, there is only one replica of it now. The
rebalancer will automatically use it to create a new replica on Node 1, and thereby restore full fault tolerance for Slice 2.
In addition, while the replica of Slice 3 on Node 1 has been promoted to ranking replica, there is only one replica of it
now. The rebalancer will automatically use it to create another on Node 2 and thereby restore full fault tolerance of
Slice 3 as well.
When the rebalancing is complete, there will be two replicas of all slices, with every node storing the same amount of
data.
Figure 9
Zone failure
Xpand can be deployed across multiple zones in the cloud, or multiple racks on premises. When it is, it will
automatically store replicas on nodes in separate zones in order to maintain availability even if an entire zone fails. The
default number of replicas, 2, ensures each replica will be stored in a different zone. If the number of replicas is set to 3,
and there are three zones, Xpand will ensure there is a replica in every zone.
If six nodes are deployed across three zones, there should be two nodes per zone. By default, there will be six slices
with one slice per node. However, instead of replicas being on two different nodes, they will be in two different zones.
If Zone A fails, Xpand will automatically promote replicas in Zone B to replace the ranking replicas lost when Zone A
failed (P1 in Node 1 and P2 in Node 2). It will then use them to create additional replicas in Zone C in order to restore
fault tolerance. Finally, it will use the ranking replicas in Zone C (P5 in Node 5 and P6 in Node 6) to create additional
replicas in Zone B in order to restore fault tolerance for them as well. When the rebalancing is finished, there will be
two replicas of every slice, and they will be evenly distributed across the remaining zones.
Figure 11
Scalability
Xpand scales out with near linear performance. It allows nodes to be added or removed at any time, and continuously
monitors node resource utilization and cluster workload distribution in order to optimize performance.
This is all handled by the rebalancer, a critical component of Xpand. It is responsible for data and workload distribution.
In addition to promoting replicas and recreating lost replicas when a node and/or zone fails, the rebalancer
automatically makes changes to scale and maximize performance.
When nodes are added to a cluster, the rebalancer automatically moves one or more slices from existing nodes to the
new nodes in order to ensure an even distribution of data. If two nodes (Node 4 and Node 5) are added to the three-
node cluster below, it will automatically move Slice 4 on Node 1 to Node 4 and Slice 5 on Node 2 to Node 5. After the
rebalancing is complete, each of the five nodes will have one slice.
Figure 12
While not shown above, the same would be true if there were multiple replicas of slices. If there were two, every node
would have two slices after the rebalancing is complete, with one of them being a ranking replica for reads and the
other a replica for high availability.
It’s possible for a slice, and thus a node, to end up doing much more work than the others. In fact, it’s possible for a
node to end up with multiple slices whose data is being accessed more frequently than those on other nodes. These
types of hotspots can lead to inefficient resource utilization and suboptimal performance.
Xpand detects workload hotspots by continuously monitoring the resource utilization of each node. If one is detected,
it can automatically remove it by redirecting reads to other nodes – and without having to rebalance the data. There is
a ranking replica for every slice. It is the one used for reads. If a node is overutilized due to a busy ranking replica, Xpand
can promote another replica of it on a different node and make it the ranking replica, effectively rerouting its reads
away from the overutilized node.
Ideally, each node in the three-node cluster below would be handling about 33% of the workload, but regardless of an
even distribution of data, Node 1 is handling 50% of the workload while Node 2 is only handling 10% of the workload.
Xpand can simply change which replica is the ranking replica for one or more slices in order to even out the workload
distribution.
Slice 1 on Node 2 is promoted to ranking replica in order to transfer some of the workload from Node 1 which is
overutilized to Node 2 which is underutilized. In addition, Slice 2 on Node 3 and Slice 6 on Node 1 are both promoted
to ranking replicas. After replicas have been reranked, every node has two ranking replicas but now the workload is
distributed 30/35/35 – a far more even distribution of the workload.
With the introduction of distributed SQL, businesses no longer have to turn to NoSQL databases such as
Apache Cassandra, giving up data integrity, easy data modeling and powerful querying in order to scale
further. Nor do they have to spend millions of dollars on a hardware appliance such as Oracle Exadata or get
stuck building and maintaining their own in-house sharding solution, both diverting budget and engineering
away from application development.
There are few distributed SQL databases available today because turning a standard relational database into
a distributed database is a hard problem, and most of them are relatively new. However, Xpand is proven,
robust and mature. It has been used in production to power mission-critical applications for years, including
Samsung Cloud and the several hundred million Samsung phones connected to it.
Xpand implements the MySQL protocol and is compatible with standard MariaDB and MySQL connectors,
making it easy to migrate MySQL applications to Xpand. In addition, Xpand can be used as a storage engine
with MariaDB, providing DBAs with an easy way to add unlimited scalability to existing MariaDB deployments.
Free trial
Download the 45-day free trial and deploy an Xpand cluster with unlimited nodes and cores.
https://mariadb.com/downloads/#mariadb_platform-mariadb_xpand
SkySQL
Sign up for SkySQL and get a $500 credit to deploy a distributed SQL database within minutes to AWS or GCP using
any instance types, and with up to 18 nodes.
https://cloud.mariadb.com/skysql