0% found this document useful (0 votes)
288 views

Distributed SQL Mariadb Xpand Architecture - Whitepaper - 1106

This document discusses the architecture behind MariaDB Xpand, a distributed SQL database that scales out on commodity hardware while maintaining ACID transactions and standard SQL. It divides tables into slices that are distributed across nodes. Each node stores a subset of rows but all nodes can execute queries that may require joining rows from multiple nodes. It provides high availability, strong consistency, and elastic scalability by scaling out additional nodes as needed without disruptions.

Uploaded by

TSGA, Inc.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
288 views

Distributed SQL Mariadb Xpand Architecture - Whitepaper - 1106

This document discusses the architecture behind MariaDB Xpand, a distributed SQL database that scales out on commodity hardware while maintaining ACID transactions and standard SQL. It divides tables into slices that are distributed across nodes. Each node stores a subset of rows but all nodes can execute queries that may require joining rows from multiple nodes. It provides high availability, strong consistency, and elastic scalability by scaling out additional nodes as needed without disruptions.

Uploaded by

TSGA, Inc.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

DISTRIBUTED

SQL: THE
ARCHITECTURE
BEHIND
MARIADB XPAND
April 2021

WHITEPAPER
TABLE OF CONTENTS
1 INTRODUCTION

2 PROPERTIES
2 LOGICAL
2 DISTRIBUTED
2 TRANSACTIONAL
2 AVAILABLE
3 CONSISTENT
3 ELASTIC

4 XPAND ARCHITECTURE
4 SLICES
6 INDEXES
7 REPLICAS
7 CACHING
8 QUERY PATH
12 TRANSACTIONS
12 HIGH AVAILABILITY
14 SCALABILITY

17 CONCLUSION

Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


INTRODUCTION
Businesses, both startups and Fortune 500 companies alike, are facing scalability challenges as
most, if not all, customer engagement now takes place online via web, mobile and Internet of Things
(IoT) applications. The scalability needed by businesses such as Google and Facebook was once
considered unimaginable. Now, it’s fast becoming a modern requirement for any business who makes
it possible for customers to do things online – whether it’s placing a grocery order, watching a movie,
playing games, meeting with coworkers or buying stock.

Early on, these scalability challenges were addressed by manually sharding MariaDB and MySQL databases, resulting in
custom solutions created for specific applications, services and queries. While sharding adds scalability, it is complex,
brittle, difficult to maintain and places limitations on how data can be queried. It may not be a problem for businesses
with virtually unlimited engineering resources, but for everyone else, it’s neither preferred nor practical.

Then there were multimillion-dollar hardware appliances like Oracle Exadata… a simpler alternative if cost was not a
factor. And finally, the emergence of NoSQL databases like Apache Cassandra. They provided scalability, but at the cost
of data integrity, consistency and a standard query language.

All of these approaches were workarounds that required businesses to give up something in order to get scalability,
whether it was flexibility, data integrity or millions of dollars. However, there is now a proper solution – distributed SQL.
In the past, relational databases were limited to replication for read scaling and scaling up (i.e., moving to bigger and
bigger servers) for write and storage scaling. Thus the need for sharding, hardware appliances or NoSQL.

Distributed SQL databases are relational databases engineered with a distributed architecture to scale out on
commodity hardware, on premises or in the cloud. However, there are no trade-offs. They perform ACID transactions
and support standard SQL like standard relational databases, but they can do it at a much greater scale.

MariaDB Xpand is a distributed SQL database built for scalability and performance. It can handle everything from
thousands of transactions per second to millions – all while maintaining sub-millisecond latency. In addition, MariaDB
Xpand implements the MySQL protocol and supports standard MariaDB and MySQL connectors, making it easy to add
scalability to applications running on MariaDB and MySQL without rewriting them.

This white paper describes the architecture of MariaDB Xpand, and explains how it scales out while at the same time
maintaining data integrity and strong consistency with ACID transactions.

1 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


PROPERTIES
Logical
In a replicated database, primary/replica or multi-primary, every database instance is itself a complete database
capable of running on its own (i.e., a physical database). When an application connects to a database instance within
a replicated database deployment, it handles all application requests by itself, independent of any other database
instance.

In a distributed SQL database, no database instance is a complete database itself. When an application connects to a
database instance within a distributed SQL database, that database instance represents all of the database instances
combined (i.e., a logical database). As a result, applications can connect to any database instance to read and write
data.

Distributed
All of the database instances within a replicated database maintain a full copy of the data and execute queries on their
own, independent of any other database instance, whereas each database instance within a distributed SQL database
maintains a subset of the data and has other database instances participate in query execution when needed.

The tables within a distributed SQL database are divided into groups of rows with different groups stored on different
database instances. As a result, the rows within a table are spread evenly across all database instances. However, any
database instance can read or write rows stored on any other. If the results of a query contain rows stored on multiple
database instances, query execution will be expanded to include them.

Transactional
Like replicated databases, distributed SQL databases use ACID transactions, and like queries, transactions can span
multiple database instances. If a write modifies rows stored on multiple database instances, those database instances
will participate in a single distributed transaction – either all of the rows are modified, or none of them are.

Available
Replicated databases provide high availability with automatic failover, whereby if the primary database instance fails,
one of the replicas is promoted to be the new primary. Applications may be unable to write data until the promotion is
complete.

Distributed SQL databases provide continuous availability because there is no primary – every database instance is
capable of writing data (and does). In order to ensure all data remains available after a database instance has failed,
distributed SQL databases store copies of rows on multiple database instances. If a database instance fails, its data
remains available through copies of its rows stored on other database instances.

2 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Consistent
While standalone databases provide strong consistency, replicated databases can suffer from replication lag. If
asynchronous or semi-synchronous replication is used, at least one of the replicas will have the latest transaction, but
the others may not. If synchronous replication is used, writes will be replicated within a transaction, but they may not
be applied within the transaction. Regardless of how replication is configured, stale reads can be possible, if not likely.

This is not a concern with distributed SQL databases as writes to rows stored on multiple nodes are performed
synchronously within transactions, ensuring strong consistency and preventing stale reads.

Elastic
Adding replicas to replicated databases scales reads, but it’s not necessarily easy or fast – and it often can’t be done
on demand. In many cases, new replicas must be created from backups and caught up via replication before they can
be used. In terms of writes and storage, it is necessary to scale up or down by creating a new database instance on a
bigger or smaller server – and often using the same approach as adding a new replica (i.e., creating a new database
instance from a backup).

Distributed SQL databases are intended to be scaled on demand, allowing nodes to be added or removed in production
without disrupting applications. When nodes are added or removed, distributed SQL databases will automatically
rebalance the data in order to maintain an even distribution of data, and in turn, an even distribution of the workload
itself. This makes distributed SQL databases particularly effective for applications with volatile workloads or significant
but temporary peaks (e.g., e-commerce and Black Friday).

In addition, distributed SQL databases are designed to be scaled out and back. Rather than having to upgrade or
downgrade servers, a time-consuming and disruptive process, nodes can be added or removed on demand in order to
increase or decrease capacity on the fly – and without incurring any downtime.

3 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


XPAND ARCHITECTURE
Slices
Xpand automatically divides tables into groups of rows called slices. The number of slices is equal to the number of
nodes by default, with each slice stored on a separate node. When the size of a slice exceeds the defined threshold,
8GB by default, it is split into two smaller slices.

In the following example, the books table in a three-node cluster (tbl_books) has three slices by default, with each slice
stored on a separate node.

Figure 1

Rows are mapped to slices using a distribution key. By default, it is the first column of the primary key. Optionally, it
can be the first n columns of the primary key if it is a composite key, with n ranging from one to the total number of
columns in the primary key.

The distributed key is hashed into a 64-bit number using a consistent hashing algorithm, with each slice assigned a
range (i.e., min/max). In order to determine which slice a row is stored in, the hash of its distribution key is compared
against the range of each slice.

The example below shows min/max values for three slices, and for simplicity, assumes the maximum value of an
unsigned short.

• If the distribution key of a row is hashed to 100, it would be stored in Slice 1

• If the distribution key of a row is hashed to 22000, it would be stored in Slice 2

• If the distribution key of a row is hashed to 50000, it would be stored in Slice 3

4 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Slice Min hash Max hash

1 0 21845

2 21846 43690

3 43691 65535

Continuing the example above, different rows within the books table are stored in separate slices based on the hash of
their primary key (id).

tbl_books, slice 1

id name author type

1 Against a Dark Background Iain M. Banks Hardcover

5 Salvation Peter F. Hamilton Hardcover

9 Barbary Station R. E. Stearns Hardcover

tbl_books, slice 2

id name author type

2 The Peripheral William Gibson Hardcover

3 The Abyss Beyond Dreams Peter F. Hamilton Paperback

7 On the Steel Breeze Alastair Reynolds Paperback

tbl_books, slice 3

id name author type

4 The Reality Dysfunction Peter F. Hamilton Hardcover

6 The Three-Body Problem Cixin Liu Hardcover

8 Shadow Captain Alastair Reynolds Hardcover

5 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Indexes
Unlike other distributed SQL databases, Xpand automatically divides secondary indexes into slices as well – and like
tables, the default number of slices is equal to the number of nodes, with each slice stored on a separate node.

In the following example, a secondary index on the author column is created for the books table (idx_author). It is
divided into three slices by default, with each slice stored on a separate node.

Figure 2

Like table rows, index entries are mapped to slices using a distribution key and consistent hashing. By default, the
distribution key is the first column of the index. Optionally, it can be the first n columns of a composite index, with n
ranging from one to the total number of columns in the index.

Continuing the example above, different entries within the index are stored in separate slices based on the hash of
their column(s).

idx_author, slice 1

Key (columns=author) Value (PK, columns=id)

Iain M. Banks 1

Alastair Reynoldss 7, 8, 8

idx_author, slice 2

Key (columns=author) Value (PK, columns=id)

William Gibson 2

R. E. Stearnss 9, 8

idx_author, slice 3

Key (columns=author) Value (PK, columns=id)

Peter F. Hamilton 3, 4, 5

Cixin Liu 68

6 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Replicas
Xpand can store multiple copies of table and index slices. They are referred to as replicas. By default, slices have two
replicas. However, a slice can have as few as one replica or as many replicas as there are nodes (i.e., one replica per
node). Further, different tables can have different numbers of replicas, as can tables and their secondary indexes.

In the following example, there are two replicas of the books table and author index slices.

Figure 3

Caching
Xpand, like any other database, caches as much data as possible in memory. However, unlike many others, it uses a
distributed cache – effectively combining the memory of multiple nodes to create a single cache with no duplicate
entries. When slices have multiple replicas, one of them is designated the ranking replica. This replica is used for reads
while the others are there for high availability. And because nodes only cache their ranking replicas (of which there is
only one), there are no duplicates – maximizing the amount of data the entire cluster can cache.

In the following example, ranking replicas are cached in memory while all replicas, including ranking replicas, are stored
on disk. If needed, simply add nodes to ensure all ranking replicas can be stored in memory for low-latency reads.

7 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Figure 4

Query path
Queries can be routed to any node within a cluster because every node is aware of how the data is distributed, and can
determine which nodes need to participate in a query and can forward the query (or part of it) to them.

Reads

Primary key

If a table is queried by its primary key, the Xpand node that receives the query hashes the primary key and determines
which node has the table slice the row is stored in. The receiving node then forwards the query to this node and returns
the results.

Example: Query a table using a primary key


SELECT * FROM tbl_books WHERE id=1;

MariaDB MaxScale is an advanced database proxy. It abstracts away the database infrastructure, making it look like
a standalone database regardless of whether there are 3 nodes or a 100. In the example below, MaxScale routes the
query to Node 2. Node 2 then determines the row with id=1 is stored in slice 3 based on its hash value (60000), with
replicas of slice 3 stored on Node 1 and Node 3.

hash(1) = 60000 = Slice 3 = Node 1, Node 3

8 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


tbl_books slices

Min hash Max hash Slice Nodes

0 21845 1 1 (rr), 2

21846 43690 2 2 (rr), 3

43691 65535 3 3 (rr), 1

Node 2 will then forward the query to Node 3 because it has the ranking replica, and return the results to MaxScale
(which returns the results to the client).

Figure 5

Secondary indexes

If a table is queried using a secondary index, the Xpand node that receives the query hashes the index key and
determines which node has the index slice the key is stored in.

Example: Query a table using a secondary index


SELECT * FROM tbl_books WHERE author=’Alastair Reynolds’;

MaxScale forwards the query to Node 2. Node 2 then determines the index key ‘Alastair Reynolds’ is stored in Slice 1
based on its hash value (5000), with replicas of slice 1 stored on Node 1 and Node 2.

hash(‘Alastair Reynolds’) = 5000 = Slice 1 = Node 1, Node 2

9 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


idx_author slices

Min hash Max hash Slice Nodes

0 21845 1 1 (rr), 2

21846 43690 2 2 (rr), 3

43691 65535 3 3 (rr), 1

Node 2 then looks up the primary keys of matching rows by checking Slice 1 on Node 1 because it is the ranking replica.

idx_author

Key (columns=author) Value (PK, columns=id)

Alastair Reynolds 7, 8

Finally, Node 2 determines rows with id=7 and id=8 are stored in Slice 2 and Slice 3 based on their hashes, and then
forwards the query to Node 2 (itself) and Node 3 because they have the ranking replicas of Slice 2 and Slice 3.

Figure 6

10 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Writes
Xpand performs writes spanning multiple nodes within distributed transactions, implicitly and explicitly. In addition,
within these transactions, all replicas are written in parallel in order to maximize performance and provide strong
consistency.

In the same way reads use consistent hashing to determine which node(s) queries using primary keys or secondary
indexes should be forwarded to, so do writes.

Example: Update a table using a primary key


UPDATE tbl_books SET num_pages = 435 WHERE id=’9’;

MaxScale forwards the query to Node 2. Node 2 then determines the row with id=9 is stored in Slice 3 based on its hash
value (64000), with replicas of Slice 3 stored on Node 1 and Node 3.

hash(9) = 64000 = Slice 3 = Node 1, Node 3

tbl_books slices

Min hash Max hash Slice Nodes

0 21845 1 1 (rr), 2

21846 43690 2 2 (rr), 3

43691 65535 3 3 (rr), 1

Node 2 then forwards the query to all nodes with replicas of Slice 3, updating the row on Node 1 and Node 3 at the
same time.

Figure 7

11 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Transactions
Xpand implements distributed transactions with snapshot isolation using a combination of three-phase commit (3PC),
consensus via Paxos, two-phase locking (2PL) and multi-version concurrency control (MVCC).

When a node receives a transaction, it becomes the transaction coordinator for it. The transaction is assigned a
transaction id (xid), and each subsequent statement within the transaction is assigned an invocation id (iid). When
rows are modified, write locks are obtained for them and new MVCC entries are created. The write locks do not block
reads, only other writes. If a write fails because it cannot obtain a write lock, the transaction will be rolled back.

To commit the transaction, the transaction coordinator initiates the PREPARE phase. Next, during the ACCEPT phase,
it selects three random nodes to persist the transaction state and waits for a consensus using Paxos – ensuring
the transaction can be completed even if the transaction coordinator fails. Finally, during the COMMIT phase, all
participating nodes persist their MVCC entries and remove any write locks.

In terms of consistency, as mentioned previously, Xpand defaults to SNAPSHOT ISOLATION. In terms of ANSI SQL, it is
similar to REPEATABLE READ but does not exhibit phantom reads because all reads within a transaction will use the last
committed data prior to the start of it.

High availability
Node failure
If a node fails, the cluster will automatically detect it and perform a group change. After the group change, the
rebalancer will automatically replace the lost ranking replicas on the failed node by promoting replicas on the remaining
nodes to be ranking replicas.

In addition, the rebalancer will replace any lost replicas on the failed node by using the remaining ranking replicas to
create new ones.

Example #1: Promoting replicas upon node failure

Node 3 has the ranking replica for Slice 3. If Node 3 fails, the replica of Slice 3 on Node 1 will be promoted to ranking
replica.

Figure 8

12 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Example #2: Recreating lost replicas upon node failure

Node 3 has a replica of Slice 2. While the ranking replica remains on Node 2, there is only one replica of it now. The
rebalancer will automatically use it to create a new replica on Node 1, and thereby restore full fault tolerance for Slice 2.

In addition, while the replica of Slice 3 on Node 1 has been promoted to ranking replica, there is only one replica of it
now. The rebalancer will automatically use it to create another on Node 2 and thereby restore full fault tolerance of
Slice 3 as well.

When the rebalancing is complete, there will be two replicas of all slices, with every node storing the same amount of
data.

Figure 9

Zone failure
Xpand can be deployed across multiple zones in the cloud, or multiple racks on premises. When it is, it will
automatically store replicas on nodes in separate zones in order to maintain availability even if an entire zone fails. The
default number of replicas, 2, ensures each replica will be stored in a different zone. If the number of replicas is set to 3,
and there are three zones, Xpand will ensure there is a replica in every zone.

Example #1: Replica in different zones

If six nodes are deployed across three zones, there should be two nodes per zone. By default, there will be six slices
with one slice per node. However, instead of replicas being on two different nodes, they will be in two different zones.

13 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Figure 10

Example #2: Re-creating lost replicas upon zone failure

If Zone A fails, Xpand will automatically promote replicas in Zone B to replace the ranking replicas lost when Zone A
failed (P1 in Node 1 and P2 in Node 2). It will then use them to create additional replicas in Zone C in order to restore
fault tolerance. Finally, it will use the ranking replicas in Zone C (P5 in Node 5 and P6 in Node 6) to create additional
replicas in Zone B in order to restore fault tolerance for them as well. When the rebalancing is finished, there will be
two replicas of every slice, and they will be evenly distributed across the remaining zones.

Figure 11

Scalability
Xpand scales out with near linear performance. It allows nodes to be added or removed at any time, and continuously
monitors node resource utilization and cluster workload distribution in order to optimize performance.

This is all handled by the rebalancer, a critical component of Xpand. It is responsible for data and workload distribution.
In addition to promoting replicas and recreating lost replicas when a node and/or zone fails, the rebalancer
automatically makes changes to scale and maximize performance.

14 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


If the size of a slice grows beyond its threshold, the rebalancer will automatically split the slice into two. If a node grows
to store more data than the others, the rebalancer will automatically move one or more of its slices to other nodes. In
addition, and shown below, the rebalancer ensures an even distribution of data when nodes are added or removed, as
well as an even distribution of the workload.

Example #1: Adding a node

When nodes are added to a cluster, the rebalancer automatically moves one or more slices from existing nodes to the
new nodes in order to ensure an even distribution of data. If two nodes (Node 4 and Node 5) are added to the three-
node cluster below, it will automatically move Slice 4 on Node 1 to Node 4 and Slice 5 on Node 2 to Node 5. After the
rebalancing is complete, each of the five nodes will have one slice.

Figure 12

While not shown above, the same would be true if there were multiple replicas of slices. If there were two, every node
would have two slices after the rebalancing is complete, with one of them being a ranking replica for reads and the
other a replica for high availability.

Example #2: Workload management

It’s possible for a slice, and thus a node, to end up doing much more work than the others. In fact, it’s possible for a
node to end up with multiple slices whose data is being accessed more frequently than those on other nodes. These
types of hotspots can lead to inefficient resource utilization and suboptimal performance.

Xpand detects workload hotspots by continuously monitoring the resource utilization of each node. If one is detected,
it can automatically remove it by redirecting reads to other nodes – and without having to rebalance the data. There is
a ranking replica for every slice. It is the one used for reads. If a node is overutilized due to a busy ranking replica, Xpand
can promote another replica of it on a different node and make it the ranking replica, effectively rerouting its reads
away from the overutilized node.

Ideally, each node in the three-node cluster below would be handling about 33% of the workload, but regardless of an
even distribution of data, Node 1 is handling 50% of the workload while Node 2 is only handling 10% of the workload.
Xpand can simply change which replica is the ranking replica for one or more slices in order to even out the workload
distribution.

15 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


Figure 13

Slice 1 on Node 2 is promoted to ranking replica in order to transfer some of the workload from Node 1 which is
overutilized to Node 2 which is underutilized. In addition, Slice 2 on Node 3 and Slice 6 on Node 1 are both promoted
to ranking replicas. After replicas have been reranked, every node has two ranking replicas but now the workload is
distributed 30/35/35 – a far more even distribution of the workload.

Node 1 Node 2 Node 3

Before After Before After Before After

Slice 1 30% 0% 0% 30% N/A N/A

Slice 2 N/A N/A 5% 0% 0% 5%

Slice 3 0% 0% N/A N/A 30% 30%

Slice 4 20% 20% 0% 0% N/A N/A

Slice 5 N/A N/A 5% 5% 0% 0%

Slice 6 0% 0% N/A N/A 10% 0%

Total 50% 30% 10% 35% 40% 35%

16 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER


CONCLUSION
Distributed SQL removes the last remaining limitation of standard relational databases, scalability. Distributed
SQL databases retain standard SQL, ACID transactions and strong consistency – all the characteristics
businesses have long relied upon to support mission-critical applications – while adding the scalability
necessary to support those whose throughput and latency requirements exceed those standard relational
databases are capable of.

With the introduction of distributed SQL, businesses no longer have to turn to NoSQL databases such as
Apache Cassandra, giving up data integrity, easy data modeling and powerful querying in order to scale
further. Nor do they have to spend millions of dollars on a hardware appliance such as Oracle Exadata or get
stuck building and maintaining their own in-house sharding solution, both diverting budget and engineering
away from application development.

There are few distributed SQL databases available today because turning a standard relational database into
a distributed database is a hard problem, and most of them are relatively new. However, Xpand is proven,
robust and mature. It has been used in production to power mission-critical applications for years, including
Samsung Cloud and the several hundred million Samsung phones connected to it.

Xpand implements the MySQL protocol and is compatible with standard MariaDB and MySQL connectors,
making it easy to migrate MySQL applications to Xpand. In addition, Xpand can be used as a storage engine
with MariaDB, providing DBAs with an easy way to add unlimited scalability to existing MariaDB deployments.

Get started today


Get an Xpand cluster up and running in minutes.

Free trial

Download the 45-day free trial and deploy an Xpand cluster with unlimited nodes and cores.

https://mariadb.com/downloads/#mariadb_platform-mariadb_xpand

SkySQL

Sign up for SkySQL and get a $500 credit to deploy a distributed SQL database within minutes to AWS or GCP using
any instance types, and with up to 18 nodes.

https://cloud.mariadb.com/skysql

17 Distributed SQL: The architecture behind MariaDB Xpand WHITEPAPER

You might also like