CloudJumpII SIGMOD
CloudJumpII SIGMOD
net/publication/392911584
CITATIONS READS
0 120
9 authors, including:
Sheng Wang
Southwest Jiaotong University
161 PUBLICATIONS 3,606 CITATIONS
SEE PROFILE
All content following this page was uploaded by Zongzhi Chen on 25 June 2025.
Leader-follower Pattern
RW RO RO
Compute Layer
The CloudJump framework [12] aims to shed light on how compute-
storage disaggregation can effectively enhance cloud databases’ ……
scalability, resilience, and efficiency by only relying on standard
cloud storage services. To explore the evolving landscape of cloud-
native databases, this study navigates through the design and im- Network I/O
plementation intricacies of the shared-storage architecture in the
Shared Storage
CloudJump framework. By focusing on the shared storage model in-
Storage Layer
tegral to cloud database systems with a leader-follower pattern, we
address the challenges of ensuring data consistency across multiple
compute nodes—a cornerstone for achieving scalability in cloud
environments. The initiation of Multi-Version Data (MVD) technol- Figure 1: Cloud-native databases based on shared storage.
ogy emerges as a key effort to enhance node consistency within
the shared storage framework. MVD’s role in enabling effective
version management and facilitating precise data page construction nodes, all accessing a shared network of storage nodes. By eliminat-
by compute nodes is discussed, highlighting its potential to stream- ing cross-node duplication and transmission of data pages, shared-
line recovery processes and improve data persistence strategies storage architectures meet the growing demand for systems that
without the necessity for a specially tailored storage layer. This offer scalability, elasticity, and resource efficiency. Its capability to
approach not only seeks to simplify operational complexities and optimize resource use and support dynamic workloads highlights
reduce costs, but also reflects a thoughtful exploration of solutions its significance in the cloud-native database landscape.
to the challenges of cloud storage. Through industrial practice in To construct an industrial cloud-native database system follow-
the cloud product PolarDB, we provide insights into the real-world ing the shared-storage architecture depicted in Figure 1, frame-
applications and benefits of MVD technology, such as improved works exemplified by Amazon Aurora [50] have further tailored
node decoupling and faster failover mechanisms, thereby contribut- the storage layer into dedicated page services that allow redo logs to
ing to the discourse on advancing cloud-native database systems. be applied to data pages accessed by compute nodes in a centralized
The effectiveness of CloudJump II is validated through extensive manner. However, this approach hinders the storage layer from
experimental evaluation and results from production deployment. exploiting those standardized and continuously improved cloud
storage services across various cloud vendors, including IBM Spec-
trum Scale (GPFS) [45], Amazon AWS EFS [4], Microsoft Azure
1 INTRODUCTION Files [8], Google Cloud Filestore [21], and CephFS [55]. As such,
The transition to cloud-native architectures represents a pivotal we propose an exploration of constructing a shared storage layer
moment in database technology, heralding an era that has reshaped for cloud databases using standardized cloud storage, based on the
the synergy between computing and storage [5, 33, 34, 49]. Central CloudJump [12] framework. This framework enhances the flexibil-
to this evolution is the principle of compute-storage disaggrega- ity and scalability of storage solutions, better meeting the dynamic
tion [7, 17, 23, 59, 61], a shift that redefines the core architecture of demands of cloud-native applications. By relying on standard com-
cloud databases. This disaggregation delineates the database sys- ponents, CloudJump facilitates the creation of a high-quality cloud-
tem into distinct compute (responsible for query and transaction native database service across various cloud platforms, thereby
processing) and storage (managing log and data page persistence) circumventing vendor lock-in [40]. It has become a cornerstone for
layers, both of which can scale independently. Complementing this a myriad of applications in orchestrating cloud resources.
shift is the adoption of shared storage, a foundational component of The adoption of shared storage for multiple database nodes
cloud-native databases that embodies the shared-everything para- brings its own set of challenges, particularly in maintaining data
digm [10, 15, 36]. In contrast to the shared-nothing systems exem- consistency—a concern that stands at the forefront of cloud-native
plified by Google Spanner [13], where each compute node manages databases [31, 52]. To elaborate, when multiple compute nodes rely
isolated storage, shared storage consolidates storage into a unified, on a single shared storage system, there exists an inherent delay in
interconnected service accessible by all database nodes [1, 21, 53]. the propagation of updates among them. Writes accepted by the
Figure 1 illustrates this architecture, depicting a shared storage RW node take effect immediately, subsequently necessitating syn-
subsystem integrated with a leader-follower compute layer [2, 17], chronization with RO nodes. Owing to the communal nature of the
including one read-write (RW) node and multiple read-only (RO) storage, discrepancies between RO nodes and the shared storage can
Z. Chen et al.
arise, such as situations where the in-memory index of a lagged RO physical construction of the majority of database storage engines,
node references an unbuffered data page in shared storage that has demonstrating strong universality [38]. The integration of MVD
been modified by the RW node. Traditionally, this challenge is miti- in PolarDB ensures full InnoDB compatibility, underscoring its
gated through strategies that bind RW and RO nodes together, such transformative potential within the realm of shared storage tech-
as constraining RW write operations to facilitate RO synchronization, nology. The pursuit of shared storage architectures in cloud-native
or by temporarily suspending RO services until the synchronization databases extends beyond a mere technical endeavor; it represents
is achieved, thereby precluding responses to queries in the absence a strategic response to the digital economy’s escalating demands
of guaranteed consistency—a discussion pivots around the delicate for high availability, rigorous quality of service assurances, and
equilibrium of sacrificing performance or availability. unparalleled elasticity, while emphasizing cost-effectiveness.
In this paper, we chart the course of CloudJump’s journey through Reflecting on our prior work initiated by “CloudJump I” [12],
the evolution of its storage layer, presenting an industrial solution we delineated methodologies for integrating cloud storage within
that deftly navigates the complexities of cloud storage. Our ap- the paradigm of cloud-native databases. This initiative underscored
proach is characterized by its utilitarian simplicity, harnessing the the imperative differentiation between cloud storage modalities
ubiquity and economy of standard cloud storage services. This ad- and localized SSD storage, illuminating critical concerns such as
vances the assertion that cloud databases could seamlessly integrate pronounced I/O latency, suboptimal bandwidth utilization, and the
extensive storage capabilities, unwavering reliability, economic effi- essential reconfiguration of traditional database architectures tai-
ciency, and scalable flexibility. In CloudJump, the core challenge of lored for cloud ecosystems. Progressing beyond these foundational
resolving the data consistency issue involves ensuring consistent insights, in this paper, CloudJump II investigates the intricacies of
in-memory data across numerous compute nodes, necessitating a databases predicated on shared storage frameworks.
detailed exploration of the storage subsystem’s pivotal role. The Industrial practice has demonstrated the effectiveness of the
key contradiction lies in the fact that standardized cloud storage approach, where MVD reduced out-of-memory (OOM) crashes
services (e.g., POSIX-compliant shared file systems [6, 10, 24]) are in RO nodes by approximately 40%, accelerated failure recovery,
not specifically designed to coordinate their data streaming from and minimized downtime, thereby enhancing system availability.
multiple computing nodes. This results in leader nodes potentially These improvements have bolstered our reputation and reduced
updating shared storage data ahead of lagging follower nodes. Such compensation costs associated with service failures. This effort
discrepancies can lead to conflicts with the in-memory data of the highlights our ongoing commitment to enhancing cloud-native
followers. A viable resolution entails enabling shared storage to database efficiency through a unified shared storage model, moving
concurrently provide multiple valid data versions, thereby allow- away from traditional architectures where each compute node pairs
ing asynchronous nodes to access data pages consistent with their with independent storage resources.
current state. Unfortunately, standard cloud storage services do not Our main contributions are summarized as follows:
inherently provide such a capability on their own. • The paper identifies a crucial challenge in the realm of cloud-
To achieve this goal, we propose Multi-Version Data (MVD) native databases: exploring the construction of shared storage
technology within a leader-follower architecture, offering a robust layers using standardized cloud storage services. This allows for
mechanism for maintaining node consistency in shared storage the integration of standard cloud storage with shared-storage ar-
environments. MVD enables storage engines to manage versioning chitecture in cloud-native databases, leveraging their respective
with precision, supporting compute nodes in constructing accurate advantages. We present a comprehensive solution that harmo-
data pages. This approach aligns with Multi-Version Concurrency nizes the competing demands of performance, availability, and
Control [58] (MVCC) practices, enhancing recovery efficiency and consistency within cloud databases, representing a significant
optimizing data persistence. By reducing node interdependencies step forward in leveraging cloud storage to enhance database
and operational overhead, MVD provides a practical solution to scalability and interoperability.
node consistency issues. It serves as a foundational model for com- • The paper proposes MVD technology, a novel method that adeptly
pute nodes, balancing structural consistency with access to data resolves issues of data consistency across multiple compute
versions. Furthermore, this technology decouples node architec- nodes. MVD allows for intricate version management upon stan-
ture, enabling faster recovery and more effective data persistence dardized cloud storage, enabling accurate data page construction
strategies without reliance on specialized intermediary layers. and optimal data persistence. This methodological innovation
As a critical subsystem within the PolarDB service focusing on serves as a cornerstone for balancing performance with consis-
storage decoupling and standardization, CloudJump highlights its tency, showcasing a comprehensive solution to the complexities
practical deployment in production environments rather than as of shared storage systems.
prototype research. Integrated within the PolarDB storage engine, • The practical deployment of MVD technology within PolarDB,
it supports over a million CPU database instances, serving more an industrial product by Alibaba Cloud, evidences the method’s
than 10,000 users across 80 global availability zones and managing effectiveness. This deployment demonstrates significant improve-
over 20,000 database clusters. The deployment of these technolo- ments in node decoupling and failover speeds, corroborated
gies has been validated through extensive testing and real-world by empirical results from extensive testing and real-world ap-
applications. Essentially, MVD constructs a shared storage layer for plication. The implementation demonstrates both theoretical
database nodes under the leader-follower pattern upon a number of progress and practical advantages for cloud-native databases, un-
standard shared storage service instances, offering page-granular derscoring the paper’s contributions in enhancing shared storage
access with multi-version support. This approach aligns with the databases’ performance and reliability.
CloudJump II: Optimizing Cloud Databases for Shared Storage
Insert 90 Search 97
Page 6 Page 7 Page 8 Page 6 Page 7 Page 8 Page 9 Page 6 Page 7 Page 8
Storage
Shared
9 13 22 35 36 37 40 44 50 58 88 97 9 13 22 35 36 37 40 44 50 58 88 90 97 9 13 22 35 36 37 40 44 50 58
(a) RO/RW initial state (b) RW update and flush pages (c) RO state thereafter
Figure 3: An example of data inconsistency due to the shared stroage architecture.
better scalability. Consequently, it is evident that the update of Flush List Last Dirty
……
Dirty
……
in Buffer Pool Page Page Page
in-memory data on RO nodes depends on asynchronously catching
up with the redo log, while the update of external data pages relies
RW
on the write-back from the Buffer Pool of RW nodes, which cannot Redo Log … …
inherently maintain consistency. This leads to the core technical Sequence oldest_modification_lsn newest_modification_lsn
concern discussed in this paper: the data consistency of RO nodes. Delayed dirty page flushes congest waiting list and cause RW buffer overflow.
To further illustrate the situation, consider a B+-tree as an ex- Constraint 2 Constraint 1
ample shown in Figure 3, where it is assumed that the index’s
Accumulate logs that require applied, inflating RO memory usage.
non-leaf nodes corresponding to data pages are in the local Buffer
RO
Pool, whereas the leaf nodes are not cached. Figure 3(a) represents Redo Log … …
an initial state of a B+-tree index on a RO that has synchronized Parse Buffer oldest_applied_lsn newest_applied_lsn
with RW. At this moment, as shown in Figure 3(b), RW accepts an op-
eration Insert-90, which triggers node splitting, updates Page-5, Figure 4: Technical challenges to force data consistency.
modifies Page-8, and creates Page-9 on the shared storage. Al-
though such insertion would result in RW writing WAL in shared imposed a dual physical replication constraint as shown in Figure 4.
storage and broadcasting the latest LSN to all RO nodes, one RO may For a dirty page residing in the RW Buffer Pool awaiting flush, the
not have had the chance to process these corresponding redo logs difference between the dirty page and the persisted version that can
before receiving a query, such as Search-97 shown in Figure 3(c). be seen by all RO nodes in shared storage is represented by a series
This triggers a catastrophic inconsistency, where Page-5 in RO’s of redo logs. This can be denoted as an interval of log sequences
local Buffer Pool remains in its old version before the split caused [oldest_modification_lsn, newest_modification_lsn].
by Insert-90. Thus, RO attempts to traverse Page-8, but since Constraint 1 (Restricting RW flush dirty pages): When RW flushes
Page-8 is not in its Buffer Pool, RO node fetches this data page from dirty pages from its Buffer Pool into the shared storage, it must
shared storage. However, by this time, Page-8 has been modified ensure newest_modification_lsn of the page to be flushed is not
and written back by RW, clearly making it impossible to find the greater than the applied LSN among all RO nodes, in order to prevent
expected record, as the index viewed by RO is in an illegitimate and any RO from fetching a future and overly forward page.
confused state. It is crucial to note that in such inconsistent states,
Constraint 2 (Augmenting RO memory): Upon reading a page, RO
RO’s response to Search-97, which is “not found” is not an accept-
must process all related redo logs within its log parse buffer to
able “lag” but an error. This is distinct from the situation where,
apply the corresponding modifications to the page retrieved from
after RW inserts 90, RO regards Key-90 as nonexistent for a period.
shared storage, ensuring accurate updates to reflect its latest state.
Therefore, such inconsistencies must be completely avoided.
The aforementioned dual constraint can effectively address the
2.4 Current Technical Challenges issue of inconsistencies between RO in-memory data and data pages
read from shared storage. However, this approach is not optimal, as
In database systems on cloud environments, representative index-
these constraints introduce deficits in performance and flexibility:
ing structures such as LSM-trees and B+-trees present distinct
advantages and challenges, particularly in the context of main- • Stringent RW-RO Coupling: The reliance of RO on the RW’s data
taining data consistency and version management across multiple engenders a bottleneck, significantly constraining the RW’s capac-
nodes. LSM-trees, by virtue of their intrinsic mechanism for version ity for write operations. This necessitates synchronization across
management and their capacity to limit the scope of compaction, all RO nodes to preserve data integrity, adversely affecting write
provide an advantageous framework for shared storage nodes to throughput and complicating the enhancement of scalability and
access required data versions. This stands in contrast to B+-trees, redundancy mechanisms. Consequently, the RW’s performance is
which inherently lack support for multiple versions since their de- directly influenced by the efficacy of the slowest RO node.
sign permits the overwriting of existing data with new versions, • Constrained RO Latency: The close interdependence limits the
thereby necessitating the adoption of alternative strategies to en- permissible lag of RO behind RW, since excessive delays in RO
sure consistency across nodes. To mitigate the issues related to could result in rapidly expanding and accumulating redo logs
data consistency, as delineated in Section 2.3, CloudJump formerly that must be managed and applied.
CloudJump II: Optimizing Cloud Databases for Shared Storage
Sequence Redo Hash and recycling strategies, which increase space and IO usage. A seg-
ib_parsemeta
Accumulate and group 100MB
records by page as a batch. HashMap mented sorting (batch) approach is thus employed for the batch
Range for Page 3 3 offset of 3 in batch x generation of log index entries, as illustrated in Figure 6. The size
Range for Page 6 6 offset of 6 in batch x of each segment is constrained by available memory for sorting
9 offset of 9 in batch x
and the maximum tolerable RO delay. Smaller segments reduce the
Range for Page 9
efficiency of merging log index entries and their retrieval speed,
whereas larger segments increase memory demand, delay log index
ib_parsedata creation, and escalate the IO required per operation.
…… Range for Page 9 Range for Page 3 Range for Page 6 Range for Page 9
There is considerable flexibility in balancing these elements. The
In-memory Redo Hash integral to RO indicates that newly gener-
batch x-1 batch x ated log index entries are not immediately needed, allowing for
Figure 6: An example of log index generation. some delay in log index creation. Depending on the operational
Comprehensive and efficient data recovery mechanisms. The requirements, a delay ranging from 500MB to 1GB in accessing the
integrity of redo logs ensures database recovery to its pre-failure log index by RO is permissible. With respect to memory utilization
state via log replay, preserving data consistency. MVD’s support for for sorting and the volume of IO during log index generation, a
single-page recovery allows for the swift, on-demand restoration of practical approach in a multi-version system is to set batch sizes to
any page version online, bypassing full log replay. This capability 100MB records. Each batch covers distinct page ranges. The asyn-
greatly enhances recovery speed and accuracy, shortening system chronous parse thread reads and parses redo logs from the redo
recovery times and ensuring uninterrupted business operations cache, organizing them by page into sorted ranges in memory. A
and data accessibility in cloud-native settings. range includes a header denoting the page and redo log location de-
tails. After accumulating 100MB of redo logs or when changing redo
These traits within the proposed shared storage framework high-
files, an actual file write (ib_parsedata) is executed. Ranges are
light our commitment to advancing cloud-native database technolo-
written sequentially to ib_parsedata, with each range updating
gies. By addressing shared storage challenges, i.e., data consistency,
its last position in ib_parsemeta upon writing and linking to the
optimized IO operations, and swift recovery processes, we position
previous range for the same page in the current file. ib_parsemeta
CloudJump as a robust, scalable solution for cloud-based databases.
is only flushed to storage at the end of each file cycle, ensuring
log index accessibility only after ib_parsemeta has been stored,
3.3 Essential Data Structure—Log Index
thereby maintaining a file-size lag for accessing the log index.
In the architectural design presented in Section 3.1, there exists a
pivotal process involves retrieving all absent redo logs for a spe- 3.3.2 Log Index Access. During operation, RO continuously scru-
cific page upon request. Consequently, a mechanism designed to tinizes redo logs from shared storage, parsing and updating exist-
categorize redo logs by page is necessary, denominated as log index. ing pages in the Buffer Pool as well as various in-memory states.
Throughout this process, a memory-based log index organized by
3.3.1 Log Index Generation. In the leader-follower pattern, MVD
page, referred to as the In-memory Redo Hash in Figure 5 is con-
adopts RW to facilitate the generation of log indexes. Within contem-
currently generated. This hash map associates page IDs with their
porary database architectures, such as MySQL 8.0, there have been
corresponding redo operations. Previously, memory expansion in
considerable advancements in the methodology of log generation,
this segment could lead to self-inflicted system termination. The
through the use of multi-threading and lock-free updates, enhanc-
introduction of MVD facilitates the use of a more compact redo
ing the efficiency of redo log creation. The process of flushing redo
hash resident in memory. When space becomes scarce, RO can im-
logs to persistent storage is crucial for confirming transactions and
mediately expunge the oldest entries within the redo hash. If a
flushing dirty pages, significantly impacting database performance.
user request necessitates accessing a new page and the LSN of the
Moreover, the creation of redo logs via mini-transactions [20] (mtr)
in-place page is assessed as outdated, it requires not only the In-
results in a dispersed record distribution across pages, which can
memory Redo Hash but also logs from the Persistent Redo Hash. This
lead to performance degradation and increased metadata volume
process involves fetching the needed redo records from storage
due to the necessity of maintaining a log index.
through the log index. By arranging redo records by page, the log
To address these challenges, an asynchronous log index gener-
applier can systematically apply these records to revert the page
ation method is implemented. This method retains the standard
to its intended version. Thus, the implementation of the log index
procedure for redo log writing and leverages the redo cache’s abil-
frees RO from the complications of memory expansion, effectively
ity to hold the latest segment of redo logs in memory temporarily.
resolving Constraint 2, and ensures the continuous maintenance of
An asynchronous parse thread then reads these logs, parses them,
an optimal apply LSN, indirectly resolving Constraint 1.
and generates log indexes. When a substantial number of log index
entries accumulates, they are flushed to persistent storage. 3.3.3 Crash Safety Guarantees. In databases, crash safety mech-
The primary challenge involves synchronizing the rate of log anisms are primarily ensured through WAL or constraints on the
index generation with the rapid production of redo logs without order of storage writes. The log index serves as an index for Redo
overusing CPU and IO resources. Since redo logs are produced operations, and to avoid redundancy and for simplicity, a disk write
sequentially, maintaining a comprehensive log index is infeasible order constraint is utilized for multi-version concurrency control.
due to variable space demands for page-specific indexes, compli- The process involves two files, ib_parsedata and ib_parsemeta,
cating space management and necessitating complex allocation during disk writes, adhering strictly to a sequence where file data
CloudJump II: Optimizing Cloud Databases for Shared Storage
its comparatively simple file structure. Page Redo Hash Async Parser
Redo of Redo of Sequence Redo
3.3.4 Resource Consumption and Performance. The generation and …
Page X Page Y
…
persistence of log index on RW are optimized for minimal impact on Log Writer
typical workloads, involving aspects of resource consumption:
Shared Storage
• CPU usage for parsing redo logs in the redo cache and
organizing them by page: This consumption is proportional to Figure 7: The schematic process of write elision.
the volume of user-generated redo writes, with tests indicating
situation. The innovation of Write Elision, facilitated by MVD,
an additional CPU overhead of approximately 3%-5%.
presents a novel solution by omitting the dirty page flush process
• Memory allocation for batch sorting, ib_parsedata, and upon eviction, thereby eliminating page-sized IO operations. Sub-
ib_parsemeta: For a 100MB LSN batch, the estimated memory sequent access to such a page deems it a "corrupt page." Necessary
allocation is 200MB for sorting batch and ib_parsedata, with log records are retrieved via page indexing, enabling change ap-
ib_parsemeta structures requiring around 50MB. Overall, this plication. This methodology, by sidestepping extensive page IO,
is relatively minimal and manageable. markedly improves database performance in IO-Bound conditions.
• IO resources consumed during the storage write operations The procedure is depicted in Figure 7 (a zoom-in view of Figure 5’s
of ib_parsedata and ib_parsemeta: ib_parsedata adopts an bottom left corner), encompassing the following stages:
append-only approach and achieves high IO efficiency with
(1) Operationally, a continuous Sequence Redo Hash is preserved
100MB batch accumulation; for ib_parsemeta, only 50MB is
in memory, derived from the redo log’s flush activity and the
needed per 1GB of redo logs. This method ensures that the gener-
log index generation’s async parser buffer, facilitating recent
ation speed of log index can keep pace with the redo production.
redo log access by page.
4 DESIGN AND IMPLEMENTATION DETAILS (2) Pages designated for Buffer Pool flushing, influenced by policies
such as LRU, undergo a selection process via a multi-version
MVD’s architecture adheres to the Copy-On-Write (CoW) principle
write elision strategy. This procedure evaluates various factors,
to manage shared storage updates without in-place modifications,
including current user load, the degree of modifications on the
using a log index mechanism. This allows RO nodes to dynami-
dirty page, and existing memory utilization.
cally reconstruct pages using base pages and redo logs instead of
replicating pages with each modification, thus maintaining logical (3) Pages selected for write elision have their respective redo logs
versioning with conventional storages. Given that most database compiled from the Sequence Redo Hash by page ID into the
kernels operate at the page level, this design is widely applicable. Page Redo Hash, and are then incorporated into the Flush Pool,
Additionally, CloudJump is deeply rooted in commercial products, exempting them from the current flush cycle. Pages not chosen
particularly focusing on MySQL’s InnoDB engine, where signifi- adhere to conventional flush protocols.
cant engineering has improved the B+-trees. Upcoming subsections (4) Upon subsequent access post-IO completion, the correspond-
will explore these technical enhancements in detail, emphasizing ing Page Redo Hash is extracted from the Flush Pool, with the
optimized logical page assembly and management. relevant redo logs applied.
(5) Pages within the Flush Pool are propelled to persisted stor-
4.1 Write Elision age upon fulfilling flush prerequisites, whether through dirty
In the architecture of WAL-based database engines, modifications flushing or via periodic inspections by a write elision flusher.
to a page typically generate a concise redo record, rendering the Thereafter, they are expunged from the Page Redo Hash.
page “dirty” within the Buffer Pool. Owing to the Buffer Pool’s fi- Specifically, the Sequence Redo Hash handles two principal types
nite capacity, the lack of available space necessitates a strategy of data: conventional redo data and log index data, the latter derived
like Least Recently Used (LRU) for page eviction selection. Should from asynchronous parsing. MVD aims to reduce non-essential ac-
an eviction-targeted page be dirty, it requires disk writing first, tivities by efficiently transferring redo data from the Redo Buffer,
initiating a page-sized write IO operation. In scenarios where the which the log writer flushes, utilizing the unified architecture of
data volume substantially exceeds the Buffer Pool’s capacity, this redo logs to avoid extra contention costs. Meanwhile, Log Index
event becomes commonplace: each page, once loaded into the Buffer data are generated from a buffer copy prior to flushing by the asyn-
Pool and minimally modified, faces rapid eviction and disk writ- chronous parser, and incorporate information from the standard
ing. This sequence of minor modifications resulting in significant operational memory cache. The memory designated for write eli-
write IO operations leads to IO resource wastage and may evolve sion is limited, necessitating evolution in the Sequence Redo Hash
into a database performance bottleneck, indicative of an IO-Bound to eliminate obsolete data, proposing two management strategies
Z. Chen et al.
for surplus requests. One strategy uses the log index to locate cor- involves using the maintained redo records in the hash map to
responding redo files, a process often deemed inefficient due to replay onto page contents, thus obtaining the latest page version.
the resultant increase in IO operations, potentially negating or The process encompasses reading, parsing, and applying all active
worsening the reductions in page IO provided by write elision. Redos. The unpredictability in timing mainly stems from:
Alternatively, flushing pages from the Flush Pool before discard- • Volume of redo: The time to read, parse, and process trans-
ing historical data could impair the efficacy of delaying dirty page actions correlates with the active redo log volume. Buffer Pool
flushing. The fundamental assumption of write elision rests on the aggregates pages, leading to multi-gigabyte checkpoints. Exces-
efficiency derived from consolidating multiple IO requests for the sive load, IO blocking, or software bugs may exacerbate delays in
same page within the Flush Pool, concurrently managing the cache checkpoint completion, thereby extending the time consumption.
load with pages outside the Flush Pool to prevent undue occupancy. • Page IO amplification: Pages are modified across different redo
Write elision is a technique developed to enhance the manage- segments following the access sequence. Exceeding the Buffer
ment of dirty pages in database systems by reducing unnecessary Pool’s capacity necessitates frequent swap to shared storage, com-
I/O operations and improving performance. This method involves pounded by memory usage of extra hash maps during recovery,
precise criteria to determine when a page should be flushed, thus further exacerbating IO amplification.
reducing the buildup of small redo logs that contribute to page
• Underutilization of storage characteristics: IO operations,
dirtying. A critical component of this methodology is tracking the
particularly with logs and page recycling, incur higher latency
amount of redo data per page, typically monitored with mtr and a
on distributed storage systems compared to local disks, requiring
set byte threshold of 255. This approach ensures that performance
more concurrent reads to offset delays.
remains unaffected in environments not limited by I/O. It includes
evaluating various metrics, such as the dirty page ratio in the Buffer During practical deployment practices, instances have experi-
Pool, to ensure optimal performance under different conditions. enced prolonged redo phases, resulting in extended periods of un-
Pages with significantly outdated oldest_modification_lsn are availability. The capability of single-page recovery in MVD enables
prioritized for flushing. These pages are likely to be flushed by the deferment of the time-consuming redo phase until after service
background operations if modifications are postponed, negating provision. By leveraging the page-oriented nature of redo logs and
the benefits of delayed flushing. Additionally, the implementation the high throughput of distributed shared storage, it is possible to
constraints require that a page’s newest_modification_lsn re- significantly reduce downtime and potentially accelerate overall
main below the position of the asynchronous parser and that the completion time in the background. The modified process involves:
oldest_modification_lsn stay ahead of the lower limit of the (1) Starting the scan from the position where the log index is al-
Sequence Redo Hash. Resource constraints also apply, such as when ready generated, rather than from the checkpoint, as log index
memory allocation fails to secure necessary resources. Adhering typically remains in a relatively closer position, unrelated to
to these principles allows the write elision strategy to efficiently the length of active redo logs.
manage databases while ensuring high performance across complex (2) Completing the log index during the scan without maintaining
operational environments. the previous In-memory Redo Hash.
(3) Scanning from ib_parsemeta to identify which pages are in-
4.2 Instant Recovery
volved in post-checkpoint redo logs, marking these as “register
Fault recovery is a crucial feature of database systems, designed
pages”. Instead of applying pages synchronously during startup,
to restore committed data that was not yet persisted before inter-
the application to pages is postponed until after the instance
ruptions, by using redo logs. This function is vital not only for
becomes operational.
recovering from system failures but also supports various adminis-
(4) Subsequently, a background thread batch-processes the appli-
trative tasks over the product’s lifespan, especially when significant
cation recovery for pages.
configuration changes require a database restart. The speed of fault
recovery is paramount because it affects how quickly users can Moreover, if a user accesses any page from the register pages
access the database and their overall experience. For instance, in beforehand, all necessary redo logs must be applied to this page
MySQL, the undo phase occurs quietly after service restart, avoid- before it is returned. Whether for foreground or background ap-
ing any extension to the discussed startup time. The real factors plication, this process requires accessing all necessary redo logs
contributing to downtime are setting up necessary data structures for the page through log index, similar to the RO’s log index read-
and allocating memory. However, the most time-intensive part ing process previously mentioned, with the restored page being
involves performing and applying redo operations. removed from register pages. This workflow demonstrates how
instant recovery can significantly advance the timing of instance
4.2.1 Primary Recovery Process. The recovery process generally
recovery and service provision from completing an extensive Redo
follows these steps. Initially, the current checkpoint position is read
Apply to just scanning a minimal segment of redo logs.
from the ib_checkpoint file. Starting from this checkpoint, a se-
quential scan of redo logs is performed until the last complete mtr is 4.2.2 Segment-oriented Recovery. Instant recovery utilizes a back-
found. The redo of the last incomplete mtr is then truncated, ensur- ground process that applies pages while mitigating IO amplification
ing the atomicity of the mtr. Throughout this scan, all encountered issues. Nevertheless, this process introduces additional redo read IO
redo logs are parsed and maintained in an in-memory hash map, for log indexes and redo logs due to frequent, fine-grained access.
sorted by page. The scan either concludes or triggers an exception Such slow IO can impede the speed of page recovery in the back-
of Page Apply if the hash map consumes excessive memory. This ground, negatively impacting normal user operations over time. To
CloudJump II: Optimizing Cloud Databases for Shared Storage
Table 1: Comparison of Recovery Stages The capability offered by multi-versioning to obtain a specific
Stage Instant Recovery version of a page on demand naturally solves this problem. The core
idea behind One-Pass Restore is to perform restoration according to
Without With the sequence of Pages rather than the sequence of redo log access.
Redo Scans • All Active redo log • Only after log index As the name One-Pass implies, each page undergoes just one read
• Single-thread • Fetch Register Page IO and one write IO in the process, applying all the necessary
from ib_parsemeta redo logs. During the One-Pass process for a page, it is essential
Redo Parse • All Active redo log to access all Redo content for that page through multi-version log
• Single-thread indexes. Given that redo logs for different pages are interwoven
Redo Apply • Synchronous • Asynchronous background in continuous Redo files, and considering the segmented sorting
• In Redo sequence • By segments characteristic of log indexes, it is crucial to avoid additional IO
• Limited parallelism • Intra-segment concurrency amplification from accessing redo logs or log indexes. To this end,
a log merging strategy has been implemented, focusing on:
When available? After Redo Apply After Redo Scans
(1) Log index merging: By extending the log index format to
support storing redo content directly, in addition to the location
tackle this issue, a common method is to enhance redo logs and
information of redo logs, this approach eliminates the overhead
log indexes with memory caches. However, the page restoration
of random access of redo logs after obtaining the log index.
process might involve a broad array of redo logs, resulting in in-
(2) Intra-file segment merging: As mentioned, an individual
sufficient differentiation of data “hotness”—an essential factor for
ib_parsedata may contain multiple segments, which are or-
effective cache use. An alternative, multi-version instant recovery,
dered by page within each segment but not connected between
avoids direct page restoration to their latest version during back-
segments. The first step of One-Pass Restore is to merge these
ground recovery. Instead, it utilizes a segment-oriented recovery
segments within a file to achieve overall order. This is accom-
strategy alongside caching. Here, the memory cache orderly retains
plished through direct parsing of redo logs, which supports
segments of redo logs and their corresponding log index entries,
backup and restoration of historical instances, including those
from older to newer, within a set memory limit. Following this,
without multi-versioning.
background threads restore all pages within these segments, re-
(3) Inter-file log index multi-way merging: Before restoring
peating the cycle until all targeted segments are fully restored. Due
pages, a multi-way merging of all log index files is conducted
to this approach, pages in the Buffer Pool may reflect intermediate
to sequentially obtain all redo logs for each page.
states, which can result in three potential outcomes:
This comprehensive strategy effectively mitigates IO amplifica-
(1) Subsequent recovery segments continue the restoration process,
tion issues in IO-bound scenarios, ensuring efficient and stream-
applying redo logs in sequence based on the current state.
lined data recovery processes.
(2) User request access necessitates the preemptive application of
all subsequent redo logs, accepting potential delays. 4.3 Elastic Scaling of RO Node
(3) Before further access, pages are flushed or evicted, after which In high-stress scenarios, the expedited scaling of RO nodes is com-
they can be reloaded and accessed anew, either continuing with monly pursued to enhance horizontal read scalability through rapid
the recovery process or being accessed by users. augmentation of RO instances. Traditionally, integrating a new RO
node necessitated initiating a synchronization process from the
In segment-oriented recovery, background application threads
data flush in RW node’s Buffer Pool. To incorporate a new RO node
initiate the process by loading essential memory segments and redo
while mitigating the risk of failure due to inadequate memory space
logs from ib_parsedata into a Shared Recv Cache. This cache is
of the redo cache, an enforced checkpoint to the most recent posi-
then available to all threads that need log index entries for page re-
tion was triggered upon the RO node’s addition. This checkpoint
covery, covering both user and background threads. These threads
primarily depended on the Buffer Pool’s flushing of the oldest dirty
prioritize retrieving data from the Shared Recv Cache for log access.
pages, a procedure limited by the positions of other RO nodes and
This strategy significantly reduces the I/O requirements during the
thus subject to unpredictable durations.
restoration of background pages, thus speeding up the recovery
Furthermore, under substantial write loads, this methodology
process. Additionally, as log segments are processed, the system’s
could intensify the likelihood of sequential RO node failures. With
checkpoint progresses, addressing the problem of checkpoint stag-
multiple RO nodes within a cluster, the integration of a new RO node
nation during prolonged recovery operations. We detail in Table 1
or the reboot of a previously failed one awaiting registration (during
the principal differences between the instant recovery facilitated
the checkpoint process) would restrict its applied LSN position. This
by MVD, relevant to this section, and traditional recovery methods.
restriction could precipitate In-memory Redo Hash expansion or
4.2.3 One-Pass Restore. In IO-bound scenarios, the restoration pro- even out-of-memory (OOM) failures in other RO nodes tracking the
cess, following the order of redo log access, can lead to the repetitive write LSN closely, engendering a cycle of recurrent crashes.
loading of the same page into the Buffer Pool. This process—applying The deployment of a multi-version log index alters this para-
a small redo segment, writing to disk, and swapping out—induces digm. Upon connection to the RW node, a new RO node is no longer
significant IO amplification and inefficiency, constrained by IO ca- compelled to delay for a flush; rather, it can initiate a replication re-
pacity. Mitigating this issue requires addressing IO amplification lationship from the position dictated by the log index. Any missing
for individual pages effectively. redo logs can be sourced from the log index as necessary. Through
Z. Chen et al.
a solitary log index flush, the goal is to synchronize this position Table 2: Evaluated Benchmarking Cases
closely with the RW node’s current write LSN, fostering the premise Benchmark Scenario Dataset Size
that multi-version RO nodes can begin replication from a position
proximate to the RW node’s latest write activity. SysBench Sample OLTP 30GB/300GB
TPC-C Standard OLTP 30GB/300GB
4.4 Rapid Backtrack Hotspots Live Commerce customer 100GB
Point-in-time restoration initiates a new database instance. Despite Multi-Index Table SaaS customer 100GB
optimizations like One-Pass Restore, this may still require up to ×105
4.0
several tens of minutes from start to finish. During this period, the -3.7%
3.5
new instance cannot handle user requests. Such restoration times
gent scenarios, such as a need for rapid data rollback due to user 2.5
error, extended downtime is untenable and can halt user opera- 2.0
tions. Additionally, the presence of an additional instance incurs 7.4%
1.5
increased costs. Thus, we aim to offer backtrack capabilities to sup- -2.0% -0.3%
12.4%
12.1%
1.0 17.9%
port rapid restoration of the current instance. Backtrack prioritizes
20.9%
service resumption, delaying the time-consuming page processing 0.5
-47.4%
8
Memory Usage (GBs)
-34.4% 1 1
6 -25.9%
-21.6%
0 0
0 50 100 150 0 50 100 150
arates storage from computational resources, enabling dynamic
4 4
storage scaling without affecting computing capacity. Stemming
×10 IO-bound low workload ×10 IO-bound high workload
from this transition, numerous successful and innovative database
designs have emerged, e.g., Aurora [50], Cornus [23], Hailstorm [7],
4 4
Socrates [5], Taurus [17]. Storage disaggregation promotes ad-
2 2 vanced data management policies, including tiering and automated
lifecycle management [9, 22, 25, 28], and strengthens fault toler-
0 0 ance and disaster recovery through flexible replication and backup
0 50 100 150 0 50 100 150
strategies [19, 27, 57]. It addresses challenges such as increasing
Time elapsed since RO registered (Seconds)
data volumes, the necessity for high availability, and the demand
Vanilla MVD
for cost-effective scalability, facilitating efficient management of
Figure 11: Results of RO online scaling. diverse workloads through optimized data placement and access.
×102
3.0 Shared Storage Architectures. In distributed systems, shared
storage architecture serves as the cornerstone of cloud computing,
2.5
Time consumption (Seconds)
1.5
↓ -75.5%
↓ on shared storage can trace back to RAID [41]. With the rapid expan-
sion of data and advancements in network and storage technologies,
1.0
-56.2% techniques such as fractional repetition coding [29, 42, 44, 48] have
enhanced the system’s repair capabilities and adaptability. This
0.5 has spurred a shift from localized storage solutions to sophisti-
cated distributed and parallel frameworks, addressing challenges in
0.0
-5GB -3GB -1GB 0 data consistency, security, and error management. Such evolution
Log size of backtrack distance has led to the emergence of seminal works, including GPFS [45],
Vanilla MVD GFS [21], Ceph [1], Dynamo [16], and PolarFS [10].
Figure 12: Results of backtrack performance. Data Consistency and Synchronization. Ensuring data con-
sistency and synchronization in distributed systems, particularly
5.6 Rapid Backtrack within primary-secondary architectures, mandates the amalgama-
Finally, we assessed MVD’s efficacy in database instance backtrack, tion of diverse methodologies, including multi-version concurrency
with findings in Figure 12. We craft an experiment for retrospective control [32, 47, 62], replication protocols [3, 43], and consensus
database navigation, using redo log data volume as measure of algorithms, notably Paxos [11] and Raft [39]. This body of research
the temporal span for backtrack. Unlike vanilla methods lacking a underscores the ongoing development in distributed data manage-
forward traversal capability in persisted states, in which historical ment aimed at optimizing consistency, availability, and resilience.
version recovery requires the presumption of a full backup that is in
an earlier state compared to the target, followed by restoration and 7 CONCLUSION
backward redo log traversal. Therefore, we assume the existence of The journey of Alibaba Cloud’s CloudJump framework through the
a full backup earlier than the “present” by 5GB of redo logs. For a intricate landscape of compute-storage disaggregation elucidates a
-5GB backtrack, the vanilla method reconstructs the instance based path toward redefining cloud database scalability, resilience, and
on this backup; for -3GB, it applies 2GB of redo logs after restoring efficiency. By leveraging standard cloud storage services to foster
the backup. Conversely, MVD employs a rapid backtrack approach a shared-storage architecture, this study underscores the pivotal
(Section 4.4), significantly reducing retrospection time to 44% of role of Multi-Version Data (MVD) technology in addressing the
conventional full backup reconstruction for -5GB. For non-exact paramount challenge of data consistency across compute nodes.
backup matches, as in -3GB and -1GB cases, redo log application MVD enhances operational simplicity, node consistency, recovery,
increases overhead. MVD’s backtrack time remains relatively unaf- and data persistence without a custom storage layer. The practical
fected due to the inherited capability for early startup from instant application and benefits of MVD technology, as demonstrated in
recovery, coupled with One-Pass Restore, thus circumventing the the PolarDB product, signify a leap toward optimizing cloud-native
performance inefficiencies associated with repetitive page swap- database systems, ensuring their adaptability, high availability, and
ping during the sequential log application phase from the Buffer cost-efficiency in the face of evolving digital demands. This paper
Pool. Moreover, the presence of multi-version page management en- contributes to the ongoing discourse in the cloud database domain
ables the direct retrieval of an earlier version of a page, as opposed by offering a nuanced understanding of shared storage solutions
to its reconstruction. and their impact on cloud-native databases.
CloudJump II: Optimizing Cloud Databases for Shared Storage
Based on DMVCC. In HiPC. IEEE Computer Society, 142–151. In 7th Symposium on Operating Systems Design and Implementation (OSDI ’06),
[48] Natalia Silberstein and Tuvi Etzion. 2015. Optimal Fractional Repetition Codes November 6-8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.).
Based on Graphs and Designs. IEEE Trans. Inf. Theory 61, 8 (2015), 4164–4180. USENIX Association, 307–320. [Link]
[49] Junjay Tan, Thanaa M. Ghanem, Matthew Perron, Xiangyao Yu, Michael Stone- html
braker, David J. DeWitt, Marco Serafini, Ashraf Aboulnaga, and Tim Kraska. [56] Christian Winter, Jana Giceva, Thomas Neumann, and Alfons Kemper. 2022.
2019. Choosing A Cloud DBMS: Architectures and Tradeoffs. Proc. VLDB Endow. On-Demand State Separation for Cloud Data Warehousing. Proc. VLDB Endow.
12, 12 (2019), 2170–2182. 15, 11 (2022), 2966–2979.
[50] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, [57] Timothy Wood, H. Andrés Lagar-Cavilla, K. K. Ramakrishnan, Prashant J. Shenoy,
Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz and Jacobus E. van der Merwe. 2011. PipeCloud: using causality to overcome
Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations speed-of-light delays in cloud-based disaster recovery. In SoCC. ACM, 17.
for High Throughput Cloud-Native Relational Databases. In SIGMOD Conference. [58] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. An
ACM, 1041–1052. Empirical Evaluation of In-Memory Multi-Version Concurrency Control. Proc.
[51] Hoang Tam Vo, Sheng Wang, Divyakant Agrawal, Gang Chen, and Beng Chin VLDB Endow. 10, 7 (2017), 781–792.
Ooi. 2012. LogBase: A Scalable Log-structured Database System in the Cloud. [59] Qizhen Zhang, Yifan Cai, Sebastian Angel, Vincent Liu, Ang Chen, and
Proc. VLDB Endow. 5, 10 (2012), 1004–1015. Boon Thau Loo. 2020. Rethinking Data Management Systems for Disaggre-
[52] Hiroshi Wada, Alan D. Fekete, Liang Zhao, Kevin Lee, and Anna Liu. 2011. Data gated Data Centers. In CIDR. [Link].
Consistency Properties and the Trade-offs in Commercial Cloud Storage: the [60] Qizhen Zhang, Yifan Cai, Xinyi Chen, Sebastian Angel, Ang Chen, Vincent Liu,
Consumers’ Perspective. In CIDR. [Link], 134–143. and Boon Thau Loo. 2020. Understanding the Effect of Data Center Resource
[53] Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy H. Katz, and Ion Disaggregation on Production DBMSs. Proc. VLDB Endow. 13, 9 (2020), 1568–
Stoica. 2012. Cake: enabling high-level SLOs on shared storage systems. In SoCC. 1581.
ACM, 14. [61] Zhanhao Zhao, Hexiang Pan, Gang Chen, Xiaoyong Du, Wei Lu, and Beng Chin
[54] Jianguo Wang and Qizhen Zhang. 2023. Disaggregated Database Systems. In Ooi. 2023. VeriTxn: Verifiable Transactions for Cloud-Native Databases with
SIGMOD Conference Companion. ACM, 37–44. Storage Disaggregation. Proc. ACM Manag. Data 1, 4 (2023), 270:1–270:27.
[55] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos [62] Yue Zhuge, Hector Garcia-Molina, and Janet L. Wiener. 1997. Multiple View
Maltzahn. 2006. Ceph: A Scalable, High-Performance Distributed File System. Consistency for Data Warehousing. In ICDE. IEEE Computer Society, 289–300.