0% found this document useful (0 votes)

37 views15 pages

CloudJumpII SIGMOD

The paper 'CloudJump II: Optimizing Cloud Databases for Shared Storage' discusses the CloudJump framework, which enhances cloud databases' scalability and efficiency through compute-storage disaggregation and a shared storage architecture. It introduces Multi-Version Data (MVD) technology to address data consistency challenges in leader-follower configurations, allowing multiple compute nodes to access accurate data versions without requiring specialized storage layers. The effectiveness of this approach is validated through its implementation in Alibaba Cloud's PolarDB, demonstrating significant improvements in performance and reliability for cloud-native databases.

Uploaded by

blackicebird

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views15 pages

CloudJumpII SIGMOD

Uploaded by

blackicebird

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

See discussions, stats, and author profiles for this publication at: [Link]

net/publication/392911584

CloudJump II: Optimizing Cloud Databases for Shared Storage

Conference Paper · June 2025

DOI: 10.1145/3722212.3724431

CITATIONS READS
0 120

9 authors, including:

Zongzhi Chen Zheyu Miao

Alibaba Group Alibaba Group
3 PUBLICATIONS 101 CITATIONS 9 PUBLICATIONS 99 CITATIONS

SEE PROFILE SEE PROFILE

Sheng Wang
Southwest Jiaotong University
161 PUBLICATIONS 3,606 CITATIONS

SEE PROFILE

All content following this page was uploaded by Zongzhi Chen on 25 June 2025.

The user has requested enhancement of the downloaded file.

CloudJump II: Optimizing Cloud Databases for Shared Storage
Zongzhi Chen, Xinjun Yang, Mo Sha, Feifei Li, Kang Wang, Zheyu Miao,
Jie Xu, Jianfeng Wang, Sheng Wang
Alibaba Cloud
{[Link], xinjun.y, [Link], lifeifei, [Link], [Link],
[Link], [Link], [Link]}@[Link]
Leader Node Follower Node Follower Node
ABSTRACT

Leader-follower Pattern
RW RO RO

Compute Layer
The CloudJump framework [12] aims to shed light on how compute-
storage disaggregation can effectively enhance cloud databases’ ……
scalability, resilience, and efficiency by only relying on standard
cloud storage services. To explore the evolving landscape of cloud-
native databases, this study navigates through the design and im- Network I/O
plementation intricacies of the shared-storage architecture in the

Shared Storage
CloudJump framework. By focusing on the shared storage model in-

Storage Layer
tegral to cloud database systems with a leader-follower pattern, we
address the challenges of ensuring data consistency across multiple
compute nodes—a cornerstone for achieving scalability in cloud
environments. The initiation of Multi-Version Data (MVD) technol- Figure 1: Cloud-native databases based on shared storage.
ogy emerges as a key effort to enhance node consistency within
the shared storage framework. MVD’s role in enabling effective
version management and facilitating precise data page construction nodes, all accessing a shared network of storage nodes. By eliminat-
by compute nodes is discussed, highlighting its potential to streaming cross-node duplication and transmission of data pages, shared-
line recovery processes and improve data persistence strategies storage architectures meet the growing demand for systems that
without the necessity for a specially tailored storage layer. This offer scalability, elasticity, and resource efficiency. Its capability to
approach not only seeks to simplify operational complexities and optimize resource use and support dynamic workloads highlights
reduce costs, but also reflects a thoughtful exploration of solutions its significance in the cloud-native database landscape.
to the challenges of cloud storage. Through industrial practice in To construct an industrial cloud-native database system follow-
the cloud product PolarDB, we provide insights into the real-world ing the shared-storage architecture depicted in Figure 1, frame-
applications and benefits of MVD technology, such as improved works exemplified by Amazon Aurora [50] have further tailored
node decoupling and faster failover mechanisms, thereby contribut- the storage layer into dedicated page services that allow redo logs to
ing to the discourse on advancing cloud-native database systems. be applied to data pages accessed by compute nodes in a centralized
The effectiveness of CloudJump II is validated through extensive manner. However, this approach hinders the storage layer from
experimental evaluation and results from production deployment. exploiting those standardized and continuously improved cloud
storage services across various cloud vendors, including IBM Spec-
trum Scale (GPFS) [45], Amazon AWS EFS [4], Microsoft Azure
1 INTRODUCTION Files [8], Google Cloud Filestore [21], and CephFS [55]. As such,
The transition to cloud-native architectures represents a pivotal we propose an exploration of constructing a shared storage layer
moment in database technology, heralding an era that has reshaped for cloud databases using standardized cloud storage, based on the
the synergy between computing and storage [5, 33, 34, 49]. Central CloudJump [12] framework. This framework enhances the flexibil-
to this evolution is the principle of compute-storage disaggrega- ity and scalability of storage solutions, better meeting the dynamic
tion [7, 17, 23, 59, 61], a shift that redefines the core architecture of demands of cloud-native applications. By relying on standard com-
cloud databases. This disaggregation delineates the database sys- ponents, CloudJump facilitates the creation of a high-quality cloud-
tem into distinct compute (responsible for query and transaction native database service across various cloud platforms, thereby
processing) and storage (managing log and data page persistence) circumventing vendor lock-in [40]. It has become a cornerstone for
layers, both of which can scale independently. Complementing this a myriad of applications in orchestrating cloud resources.
shift is the adoption of shared storage, a foundational component of The adoption of shared storage for multiple database nodes
cloud-native databases that embodies the shared-everything para- brings its own set of challenges, particularly in maintaining data
digm [10, 15, 36]. In contrast to the shared-nothing systems exem- consistency—a concern that stands at the forefront of cloud-native
plified by Google Spanner [13], where each compute node manages databases [31, 52]. To elaborate, when multiple compute nodes rely
isolated storage, shared storage consolidates storage into a unified, on a single shared storage system, there exists an inherent delay in
interconnected service accessible by all database nodes [1, 21, 53]. the propagation of updates among them. Writes accepted by the
Figure 1 illustrates this architecture, depicting a shared storage RW node take effect immediately, subsequently necessitating syn-
subsystem integrated with a leader-follower compute layer [2, 17], chronization with RO nodes. Owing to the communal nature of the
including one read-write (RW) node and multiple read-only (RO) storage, discrepancies between RO nodes and the shared storage can
Z. Chen et al.

arise, such as situations where the in-memory index of a lagged RO physical construction of the majority of database storage engines,
node references an unbuffered data page in shared storage that has demonstrating strong universality [38]. The integration of MVD
been modified by the RW node. Traditionally, this challenge is miti- in PolarDB ensures full InnoDB compatibility, underscoring its
gated through strategies that bind RW and RO nodes together, such transformative potential within the realm of shared storage tech-
as constraining RW write operations to facilitate RO synchronization, nology. The pursuit of shared storage architectures in cloud-native
or by temporarily suspending RO services until the synchronization databases extends beyond a mere technical endeavor; it represents
is achieved, thereby precluding responses to queries in the absence a strategic response to the digital economy’s escalating demands
of guaranteed consistency—a discussion pivots around the delicate for high availability, rigorous quality of service assurances, and
equilibrium of sacrificing performance or availability. unparalleled elasticity, while emphasizing cost-effectiveness.
In this paper, we chart the course of CloudJump’s journey through Reflecting on our prior work initiated by “CloudJump I” [12],
the evolution of its storage layer, presenting an industrial solution we delineated methodologies for integrating cloud storage within
that deftly navigates the complexities of cloud storage. Our ap- the paradigm of cloud-native databases. This initiative underscored
proach is characterized by its utilitarian simplicity, harnessing the the imperative differentiation between cloud storage modalities
ubiquity and economy of standard cloud storage services. This ad- and localized SSD storage, illuminating critical concerns such as
vances the assertion that cloud databases could seamlessly integrate pronounced I/O latency, suboptimal bandwidth utilization, and the
extensive storage capabilities, unwavering reliability, economic effi- essential reconfiguration of traditional database architectures tai-
ciency, and scalable flexibility. In CloudJump, the core challenge of lored for cloud ecosystems. Progressing beyond these foundational
resolving the data consistency issue involves ensuring consistent insights, in this paper, CloudJump II investigates the intricacies of
in-memory data across numerous compute nodes, necessitating a databases predicated on shared storage frameworks.
detailed exploration of the storage subsystem’s pivotal role. The Industrial practice has demonstrated the effectiveness of the
key contradiction lies in the fact that standardized cloud storage approach, where MVD reduced out-of-memory (OOM) crashes
services (e.g., POSIX-compliant shared file systems [6, 10, 24]) are in RO nodes by approximately 40%, accelerated failure recovery,
not specifically designed to coordinate their data streaming from and minimized downtime, thereby enhancing system availability.
multiple computing nodes. This results in leader nodes potentially These improvements have bolstered our reputation and reduced
updating shared storage data ahead of lagging follower nodes. Such compensation costs associated with service failures. This effort
discrepancies can lead to conflicts with the in-memory data of the highlights our ongoing commitment to enhancing cloud-native
followers. A viable resolution entails enabling shared storage to database efficiency through a unified shared storage model, moving
concurrently provide multiple valid data versions, thereby allow- away from traditional architectures where each compute node pairs
ing asynchronous nodes to access data pages consistent with their with independent storage resources.
current state. Unfortunately, standard cloud storage services do not Our main contributions are summarized as follows:
inherently provide such a capability on their own. • The paper identifies a crucial challenge in the realm of cloud-
To achieve this goal, we propose Multi-Version Data (MVD) native databases: exploring the construction of shared storage
technology within a leader-follower architecture, offering a robust layers using standardized cloud storage services. This allows for
mechanism for maintaining node consistency in shared storage the integration of standard cloud storage with shared-storage ar-
environments. MVD enables storage engines to manage versioning chitecture in cloud-native databases, leveraging their respective
with precision, supporting compute nodes in constructing accurate advantages. We present a comprehensive solution that harmo-
data pages. This approach aligns with Multi-Version Concurrency nizes the competing demands of performance, availability, and
Control [58] (MVCC) practices, enhancing recovery efficiency and consistency within cloud databases, representing a significant
optimizing data persistence. By reducing node interdependencies step forward in leveraging cloud storage to enhance database
and operational overhead, MVD provides a practical solution to scalability and interoperability.
node consistency issues. It serves as a foundational model for com- • The paper proposes MVD technology, a novel method that adeptly
pute nodes, balancing structural consistency with access to data resolves issues of data consistency across multiple compute
versions. Furthermore, this technology decouples node architec- nodes. MVD allows for intricate version management upon stan-
ture, enabling faster recovery and more effective data persistence dardized cloud storage, enabling accurate data page construction
strategies without reliance on specialized intermediary layers. and optimal data persistence. This methodological innovation
As a critical subsystem within the PolarDB service focusing on serves as a cornerstone for balancing performance with consis-
storage decoupling and standardization, CloudJump highlights its tency, showcasing a comprehensive solution to the complexities
practical deployment in production environments rather than as of shared storage systems.
prototype research. Integrated within the PolarDB storage engine, • The practical deployment of MVD technology within PolarDB,
it supports over a million CPU database instances, serving more an industrial product by Alibaba Cloud, evidences the method’s
than 10,000 users across 80 global availability zones and managing effectiveness. This deployment demonstrates significant improve-
over 20,000 database clusters. The deployment of these technolo- ments in node decoupling and failover speeds, corroborated
gies has been validated through extensive testing and real-world by empirical results from extensive testing and real-world ap-
applications. Essentially, MVD constructs a shared storage layer for plication. The implementation demonstrates both theoretical
database nodes under the leader-follower pattern upon a number of progress and practical advantages for cloud-native databases, un-
standard shared storage service instances, offering page-granular derscoring the paper’s contributions in enhancing shared storage
access with multi-version support. This approach aligns with the databases’ performance and reliability.
CloudJump II: Optimizing Cloud Databases for Shared Storage

2 PRELIMINARIES AND MOTIVATION Leader Node - RW Follower Node - RO

2.1 Cloud-native Databases Query Binlog Query Binlog

The advancement of cloud infrastructure has markedly accelerated Parser/Optimizer/Executor Parser/Optimizer/Executor

the utilization of cloud resources, such as Elastic Compute Service
Storage Engine Storage Engine
(ECS) and cloud storage, revolutionizing data system construction.
Transaction Transaction
This shift has enabled databases to overcome single-machine limita-
tions, facilitating a more balanced distribution of service traffic and B+ Tree B+ Tree
risk mitigation. Central to this evolution is the cloud-native shared
storage architecture, offering substantial benefits in performance, Buffer Pool REDO Buffer Pool REDO Apply

elasticity, scalability, and reliability.

The integration of a compute-storage disaggregation architec-
ture in cloud databases marks a significant advancement in data Shared Storage
management technology [54, 60], delivering diverse benefits suited Figure 2: Legacy system architecture on shared storage.
to modern enterprise needs. It enables elastic scalability [56], al-
is that each compute node possesses an in-memory Buffer Pool to
lowing independent scaling of compute and storage resources to
cache page data. When a query is executed, the required data pages
match business fluctuations [2], thereby optimizing resource al-
already buffered in local memory are directly used; if not, these
location and enhancing cost efficiency. This architecture bolsters
pages are loaded from the distributed storage layer into the buffer.
availability and supports disaster recovery via data redundancy
The essence of shared storage is that all compute nodes access a
and rapid compute node replacement [10], guaranteeing continu-
unified dataset (modified solely by the singular RW). Specifically, all
ous operations and data integrity. Performance benefits from the
in-memory data on RO nodes, including pages within the Buffer Pool,
separation of processing and storage duties, effectively managing
transaction states, and various memory cache structures, rely on
large data volumes and high concurrency. Furthermore, it facili-
synchronization through the redo logs written to shared storage by
tates flexible data management, encompassing backup, restoration,
the RW node. In contrast to traditional local database architectures
and migration without hindering compute activities [15, 18], cru-
that utilize binlog for leader-follower logic replication, PolarDB
cial for managing extensive datasets. Simplified maintenance and
adopts a synchronization method through physical logging (redo),
upgrades reduce system upkeep complexity and risks, enabling in-
which unfolds in four primary steps:
dependent management of compute and storage layers to minimize
service disruptions. Altogether, compute-storage disaggregation (1) The RW node exclusively accepts write operations, writing the
transforms cloud database management, setting a new paradigm generated redo logs into the shared storage.
for cloud-based data processing aligned with the strategic growth (2) The RW node periodically broadcasts the latest location of redo
and innovation goals of data-driven enterprises. logs (LSN) on the shared storage to the RO nodes via network.
(3) RO nodes, in turn, continuously load and parse redo logs from
2.2 Shared Storage Framework for the shared storage based on received LSN.
Leader-follower Configuration (4) Redo logs are then applied to pages within the Buffer Pool, with
transaction states and various caches in memory accordingly.
Integrating a shared storage framework within distributed database
Thus, RO nodes perceive page updates through two mechanisms:
systems to support the concept of “log-centric” [35, 51] represents
a significant evolution toward more efficient, scalable, and resilient • Active log update chasing: Upon receiving periodic broadcasts
data management practices. This framework enables the centralized of the redo log interval awaiting application in shared storage
storage of logs, facilitating unified access and processing within from the RW, RO nodes fetch and parse these logs. If a redo log
a distributed system environment. By considering logs as a pri- corresponds to a data page existing in the local Buffer Pool, this
mary data source, the framework leverages their traditional roles log is applied to synchronize the page update within the buffer.
in transaction management and system recovery, while also utiliz- • Passive on-demand access: Conversely, if a data page impacted
ing them for real-time data analytics, event sourcing, and stream by a redo log is not locally buffered, no action is taken. Should a
processing, thus maximizing the utility of logs. Without loss of node subsequently require this data page for a query, it will be
generality, consider the implementation of Alibaba Cloud’s Po- fetched from shared storage, already updated by the RW node.
larDB within a shared storage architecture, as an example shown in
Figure 2. PolarDB supports a leader-follower model based on a sin- 2.3 Core Concern—Data Consistency
gular shared dataset, comprising one leader node RW, and multiple Despite the adoption of shared storage in the aforementioned multi-
follower nodes RO. Upon write operations, the RW node generates node database architecture, the operational memory of each node
write-ahead logging [37] (WAL) files to the shared storage, and ev- remains independent. On the one hand, the cost of constructing a
ery log is identified by a log sequence number (LSN), corresponding unified distributed memory pool across multiple nodes is signifi-
to a particular version of the database. The persistence of a modi- cantly higher compared to file storage [26, 46]. On the other hand,
fication is signified by the success of log persistence, after which the leader-follower deployment approach for databases generally
the corresponding updated data page will be written back from assumes that for RO nodes, data states that are slightly outdated
the RW’s local buffer to the shared storage as well. RO nodes should or delayed in updates are acceptable and by design, thereby sac-
replay the logs to catch up with recent updates. A critical aspect rificing RW-RO synchronization to foster decoupling and achieve
Z. Chen et al.

Insert 90 Search 97

Page 3 Page 3 Page 3

In-memory
Buffer Pool

≥9 ≥50 ≥9 ≥50 ≥9 ≥50

4 5 4 5 4 5

Page 4 Page 5 Page 4 Page 5 Page 4 Page 5

≥9 ≥36 ≥50 ≥9 ≥36 ≥50 ≥88 ≥9 ≥36 ≥50
6 7 8 6 7 8 9 6 7 8

Page 6 Page 7 Page 8 Page 6 Page 7 Page 8 Page 9 Page 6 Page 7 Page 8
Storage
Shared

9 13 22 35 36 37 40 44 50 58 88 97 9 13 22 35 36 37 40 44 50 58 88 90 97 9 13 22 35 36 37 40 44 50 58

(a) RO/RW initial state (b) RW update and flush pages (c) RO state thereafter
Figure 3: An example of data inconsistency due to the shared stroage architecture.

better scalability. Consequently, it is evident that the update of Flush List Last Dirty
……
Dirty
……
in Buffer Pool Page Page Page
in-memory data on RO nodes depends on asynchronously catching
up with the redo log, while the update of external data pages relies
RW
on the write-back from the Buffer Pool of RW nodes, which cannot Redo Log … …
inherently maintain consistency. This leads to the core technical Sequence oldest_modification_lsn newest_modification_lsn
concern discussed in this paper: the data consistency of RO nodes. Delayed dirty page flushes congest waiting list and cause RW buffer overflow.
To further illustrate the situation, consider a B+-tree as an ex- Constraint 2 Constraint 1
ample shown in Figure 3, where it is assumed that the index’s
Accumulate logs that require applied, inflating RO memory usage.
non-leaf nodes corresponding to data pages are in the local Buffer
RO
Pool, whereas the leaf nodes are not cached. Figure 3(a) represents Redo Log … …
an initial state of a B+-tree index on a RO that has synchronized Parse Buffer oldest_applied_lsn newest_applied_lsn
with RW. At this moment, as shown in Figure 3(b), RW accepts an op-
eration Insert-90, which triggers node splitting, updates Page-5, Figure 4: Technical challenges to force data consistency.
modifies Page-8, and creates Page-9 on the shared storage. Al-
though such insertion would result in RW writing WAL in shared imposed a dual physical replication constraint as shown in Figure 4.
storage and broadcasting the latest LSN to all RO nodes, one RO may For a dirty page residing in the RW Buffer Pool awaiting flush, the
not have had the chance to process these corresponding redo logs difference between the dirty page and the persisted version that can
before receiving a query, such as Search-97 shown in Figure 3(c). be seen by all RO nodes in shared storage is represented by a series
This triggers a catastrophic inconsistency, where Page-5 in RO’s of redo logs. This can be denoted as an interval of log sequences
local Buffer Pool remains in its old version before the split caused [oldest_modification_lsn, newest_modification_lsn].
by Insert-90. Thus, RO attempts to traverse Page-8, but since Constraint 1 (Restricting RW flush dirty pages): When RW flushes
Page-8 is not in its Buffer Pool, RO node fetches this data page from dirty pages from its Buffer Pool into the shared storage, it must
shared storage. However, by this time, Page-8 has been modified ensure newest_modification_lsn of the page to be flushed is not
and written back by RW, clearly making it impossible to find the greater than the applied LSN among all RO nodes, in order to prevent
expected record, as the index viewed by RO is in an illegitimate and any RO from fetching a future and overly forward page.
confused state. It is crucial to note that in such inconsistent states,
Constraint 2 (Augmenting RO memory): Upon reading a page, RO
RO’s response to Search-97, which is “not found” is not an accept-
must process all related redo logs within its log parse buffer to
able “lag” but an error. This is distinct from the situation where,
apply the corresponding modifications to the page retrieved from
after RW inserts 90, RO regards Key-90 as nonexistent for a period.
shared storage, ensuring accurate updates to reflect its latest state.
Therefore, such inconsistencies must be completely avoided.
The aforementioned dual constraint can effectively address the
2.4 Current Technical Challenges issue of inconsistencies between RO in-memory data and data pages
read from shared storage. However, this approach is not optimal, as
In database systems on cloud environments, representative index-
these constraints introduce deficits in performance and flexibility:
ing structures such as LSM-trees and B+-trees present distinct
advantages and challenges, particularly in the context of main- • Stringent RW-RO Coupling: The reliance of RO on the RW’s data
taining data consistency and version management across multiple engenders a bottleneck, significantly constraining the RW’s capac-
nodes. LSM-trees, by virtue of their intrinsic mechanism for version ity for write operations. This necessitates synchronization across
management and their capacity to limit the scope of compaction, all RO nodes to preserve data integrity, adversely affecting write
provide an advantageous framework for shared storage nodes to throughput and complicating the enhancement of scalability and
access required data versions. This stands in contrast to B+-trees, redundancy mechanisms. Consequently, the RW’s performance is
which inherently lack support for multiple versions since their de- directly influenced by the efficacy of the slowest RO node.
sign permits the overwriting of existing data with new versions, • Constrained RO Latency: The close interdependence limits the
thereby necessitating the adoption of alternative strategies to en- permissible lag of RO behind RW, since excessive delays in RO
sure consistency across nodes. To mitigate the issues related to could result in rapidly expanding and accumulating redo logs
data consistency, as delineated in Section 2.3, CloudJump formerly that must be managed and applied.
CloudJump II: Optimizing Cloud Databases for Shared Storage

Leader Node - RW Follower Node - RO a multi-node architecture adhering to leader-follower patterns, in

Query Binlog Query Binlog which the components housed within the leader (RW) and follower
(RO) nodes differ. Specifically, within the RW, a Sequence Redo Hash
Parser/Optimizer/Executor Parser/Optimizer/Executor and a Page Redo Hash are employed to consolidate redo logs by
Storage Engine Storage Engine page in memory. The Flush Pool caches dirty page writing back
Transaction Transaction
and maintains valid versions, thereby enabling versioned writes
(associated with the corresponding LSN) to the shared storage by
B+ Tree B+ Tree the RW, rather than overwrites. For the RO node, a Persistent Redo
Hash is maintained in memory to facilitate indexing all relevant
Buffer Pool REDO Buffer Pool REDO Apply
redo logs by page. This enables RO to retrieve pages of particular
MVD MVD versions, acquiring the most appropriate page file that has been
Sequence In-memory persisted, along with the collection of redo logs that need be applied
Flush Pool Redo Hash Redo Hash
to achieve the desired precise page.
Page Redo Hash Persistent Redo Hash Hence, MVD guidance engine offers the capability to access a
page from any LSN within the range from the garbage collection
Shared Storage (GC) version to the latest version. This feature significantly sup-
ports the architecture with multiple database nodes, catering to the
Figure 5: System architecture of CloudJump II with MVD. diverse requirements for page versions. Within the MySQL-InnoDB
framework, the system adopts a WAL strategy that transforms ran-
These conditions could precipitate a detrimental loop, rendering dom database access requests into sequential redo log IO operations.
the RO remain perpetually unsynchronized, thus leading to data Utilizing the Buffer Pool, the system aims to consolidate and post-
inconsistencies and negatively affecting the system’s capacity to pone data persistence operations, thus enhancing efficiency and
deliver timely information. Concurrently, the immobilized RO hin- reducing IO overhead. In the event of failure, the system is capa-
ders the timely flushing of dirty pages by RW, culminating in buffer ble of recovering unflushed modifications by replaying the redo
overflow. To overcome these obstacles, a reevaluation of the stor- logs, ensuring data integrity and resilience. Importantly, the redo
age engine design for shared storage architectures is essential. This logs are designed with a page-centric approach, where each log
reevaluation should focus on decoupling the dependencies between pertains exclusively to a single page. This simplifies the recovery
RW and RO nodes, enabling independent scaling and improving sys- process by focusing solely on the relevant page, enhancing both
tem resilience. Furthermore, integrating a multi-version engine efficiency and precision. The redo logs ensure completeness and
would allow for more efficient version management, reducing the locality, i.e., the logs contain all necessary information for compre-
overhead associated with maintaining data consistency across multi- hensive database recovery and are designed to be page-oriented,
ple nodes. Such an engine would facilitate the handling of historical allowing the recovery mechanism to focus exclusively on the page
data versions without the need for complex synchronization mech- in question. By enabling access to redo logs based on page ID, the
anisms, ultimately enhancing performance and providing greater design allows for the on-demand retrieval of specific page versions,
flexibility in the design of upper-layer application logic. ensuring operational flexibility and reliability.

3 CLOUDJUMP II—THE MVD APPORACH 3.2 Key Design Ideas

The aforementioned challenges highlight a dilemma: the immediate Within the CloudJump framework, the development of MVD tech-
exposure of pages written by RW to all ROs may lead to inconsisten- nology has led to the integration of key technical features aimed
cies due to RO accessing overly advanced data pages. To maintain at refining cloud database performance for shared storage. These
data consistency, a cautious update approach is required in shared innovations are central to overcoming the challenges related to
storage, delaying page permanence until the slowest RO catches up, data consistency, synchronization delays, and scalable allocation of
and ensuring data from shared storage is appropriately synchro- computing resources in cloud-native databases:
nized. This method, however, hinders RW writes and forces RO to Efficient version management and access control. Central
chase excess redo logs for data freshness. The underlying problem to our approach is the advanced management of database page
is the forced overwrite of existing data during dirty page flushes, versions through precise control by the storage engine. This mech-
as multi-versioning is not ubiquitous in cloud storage or under the anism supports a leader-follower architecture, allowing multiple
POSIX protocol. CloudJump addresses this by integrating MVD, nodes to access required data versions with minimal latency. This
enabling multiple valid version writes and reads within compute flexibility in version management significantly boosts concurrency
nodes, thus overcoming the single-version overwrite limitation. processing and data access efficiency, mitigating data inconsisten-
cies and promoting seamless operational continuity.
3.1 System Overview Optimized log management and IO efficiency. Our strategy
In CloudJump II, the integration of the MVD module is situated employs WAL with a page-oriented redo log design, streamlining
between the storage engine within database compute nodes and the random access into sequential log IO operations. By leveraging
storage layer, as depicted in Figure 5. This configuration endows the the Buffer Pool, we enhance disk IO efficiency by consolidating
compute nodes with the capability to recognize and utilize page files data modifications, thereby reducing disk load and supporting high
with multiple versions in shared storage. The MVD is predicated on transaction volumes without compromising performance.
Z. Chen et al.

Sequence Redo Hash and recycling strategies, which increase space and IO usage. A seg-
ib_parsemeta
Accumulate and group 100MB
records by page as a batch. HashMap mented sorting (batch) approach is thus employed for the batch
Range for Page 3 3 offset of 3 in batch x generation of log index entries, as illustrated in Figure 6. The size
Range for Page 6 6 offset of 6 in batch x of each segment is constrained by available memory for sorting
9 offset of 9 in batch x
and the maximum tolerable RO delay. Smaller segments reduce the
Range for Page 9
efficiency of merging log index entries and their retrieval speed,
whereas larger segments increase memory demand, delay log index
ib_parsedata creation, and escalate the IO required per operation.
…… Range for Page 9 Range for Page 3 Range for Page 6 Range for Page 9
There is considerable flexibility in balancing these elements. The
In-memory Redo Hash integral to RO indicates that newly gener-
batch x-1 batch x ated log index entries are not immediately needed, allowing for
Figure 6: An example of log index generation. some delay in log index creation. Depending on the operational
Comprehensive and efficient data recovery mechanisms. The requirements, a delay ranging from 500MB to 1GB in accessing the
integrity of redo logs ensures database recovery to its pre-failure log index by RO is permissible. With respect to memory utilization
state via log replay, preserving data consistency. MVD’s support for for sorting and the volume of IO during log index generation, a
single-page recovery allows for the swift, on-demand restoration of practical approach in a multi-version system is to set batch sizes to
any page version online, bypassing full log replay. This capability 100MB records. Each batch covers distinct page ranges. The asyn-
greatly enhances recovery speed and accuracy, shortening system chronous parse thread reads and parses redo logs from the redo
recovery times and ensuring uninterrupted business operations cache, organizing them by page into sorted ranges in memory. A
and data accessibility in cloud-native settings. range includes a header denoting the page and redo log location de-
tails. After accumulating 100MB of redo logs or when changing redo
These traits within the proposed shared storage framework high-
files, an actual file write (ib_parsedata) is executed. Ranges are
light our commitment to advancing cloud-native database technolo-
written sequentially to ib_parsedata, with each range updating
gies. By addressing shared storage challenges, i.e., data consistency,
its last position in ib_parsemeta upon writing and linking to the
optimized IO operations, and swift recovery processes, we position
previous range for the same page in the current file. ib_parsemeta
CloudJump as a robust, scalable solution for cloud-based databases.
is only flushed to storage at the end of each file cycle, ensuring
log index accessibility only after ib_parsemeta has been stored,
3.3 Essential Data Structure—Log Index
thereby maintaining a file-size lag for accessing the log index.
In the architectural design presented in Section 3.1, there exists a
pivotal process involves retrieving all absent redo logs for a spe- 3.3.2 Log Index Access. During operation, RO continuously scru-
cific page upon request. Consequently, a mechanism designed to tinizes redo logs from shared storage, parsing and updating exist-
categorize redo logs by page is necessary, denominated as log index. ing pages in the Buffer Pool as well as various in-memory states.
Throughout this process, a memory-based log index organized by
3.3.1 Log Index Generation. In the leader-follower pattern, MVD
page, referred to as the In-memory Redo Hash in Figure 5 is con-
adopts RW to facilitate the generation of log indexes. Within contem-
currently generated. This hash map associates page IDs with their
porary database architectures, such as MySQL 8.0, there have been
corresponding redo operations. Previously, memory expansion in
considerable advancements in the methodology of log generation,
this segment could lead to self-inflicted system termination. The
through the use of multi-threading and lock-free updates, enhanc-
introduction of MVD facilitates the use of a more compact redo
ing the efficiency of redo log creation. The process of flushing redo
hash resident in memory. When space becomes scarce, RO can im-
logs to persistent storage is crucial for confirming transactions and
mediately expunge the oldest entries within the redo hash. If a
flushing dirty pages, significantly impacting database performance.
user request necessitates accessing a new page and the LSN of the
Moreover, the creation of redo logs via mini-transactions [20] (mtr)
in-place page is assessed as outdated, it requires not only the In-
results in a dispersed record distribution across pages, which can
memory Redo Hash but also logs from the Persistent Redo Hash. This
lead to performance degradation and increased metadata volume
process involves fetching the needed redo records from storage
due to the necessity of maintaining a log index.
through the log index. By arranging redo records by page, the log
To address these challenges, an asynchronous log index gener-
applier can systematically apply these records to revert the page
ation method is implemented. This method retains the standard
to its intended version. Thus, the implementation of the log index
procedure for redo log writing and leverages the redo cache’s abil-
frees RO from the complications of memory expansion, effectively
ity to hold the latest segment of redo logs in memory temporarily.
resolving Constraint 2, and ensures the continuous maintenance of
An asynchronous parse thread then reads these logs, parses them,
an optimal apply LSN, indirectly resolving Constraint 1.
and generates log indexes. When a substantial number of log index
entries accumulates, they are flushed to persistent storage. 3.3.3 Crash Safety Guarantees. In databases, crash safety mech-
The primary challenge involves synchronizing the rate of log anisms are primarily ensured through WAL or constraints on the
index generation with the rapid production of redo logs without order of storage writes. The log index serves as an index for Redo
overusing CPU and IO resources. Since redo logs are produced operations, and to avoid redundancy and for simplicity, a disk write
sequentially, maintaining a comprehensive log index is infeasible order constraint is utilized for multi-version concurrency control.
due to variable space demands for page-specific indexes, compli- The process involves two files, ib_parsedata and ib_parsemeta,
cating space management and necessitating complex allocation during disk writes, adhering strictly to a sequence where file data
CloudJump II: Optimizing Cloud Databases for Shared Storage

Clean Buffer Pool Dirty

precedes file headers, and parsedata is written before parsemeta.
Modifications to ib_parsemeta data signify the actual completion Page
Y
Page
Z
of disk writes, facilitating the generation of log index checkpoints.
Page
In the event of a crash before the ib_parsemeta content is durably X

written, the partially written data in ib_parsedata is deemed in-

valid and subsequently overwritten. The feasibility of this approach
Flush Pool Sequence Redo Hash
for ensuring crash recovery in multi-version systems is largely at- X
tributed to the append-only writing method of ib_parsedata and Sequence Log Index

its comparatively simple file structure. Page Redo Hash Async Parser
Redo of Redo of Sequence Redo
3.3.4 Resource Consumption and Performance. The generation and …
Page X Page Y
…

persistence of log index on RW are optimized for minimal impact on Log Writer
typical workloads, involving aspects of resource consumption:
Shared Storage
• CPU usage for parsing redo logs in the redo cache and
organizing them by page: This consumption is proportional to Figure 7: The schematic process of write elision.
the volume of user-generated redo writes, with tests indicating
situation. The innovation of Write Elision, facilitated by MVD,
an additional CPU overhead of approximately 3%-5%.
presents a novel solution by omitting the dirty page flush process
• Memory allocation for batch sorting, ib_parsedata, and upon eviction, thereby eliminating page-sized IO operations. Sub-
ib_parsemeta: For a 100MB LSN batch, the estimated memory sequent access to such a page deems it a "corrupt page." Necessary
allocation is 200MB for sorting batch and ib_parsedata, with log records are retrieved via page indexing, enabling change ap-
ib_parsemeta structures requiring around 50MB. Overall, this plication. This methodology, by sidestepping extensive page IO,
is relatively minimal and manageable. markedly improves database performance in IO-Bound conditions.
• IO resources consumed during the storage write operations The procedure is depicted in Figure 7 (a zoom-in view of Figure 5’s
of ib_parsedata and ib_parsemeta: ib_parsedata adopts an bottom left corner), encompassing the following stages:
append-only approach and achieves high IO efficiency with
(1) Operationally, a continuous Sequence Redo Hash is preserved
100MB batch accumulation; for ib_parsemeta, only 50MB is
in memory, derived from the redo log’s flush activity and the
needed per 1GB of redo logs. This method ensures that the gener-
log index generation’s async parser buffer, facilitating recent
ation speed of log index can keep pace with the redo production.
redo log access by page.
4 DESIGN AND IMPLEMENTATION DETAILS (2) Pages designated for Buffer Pool flushing, influenced by policies
such as LRU, undergo a selection process via a multi-version
MVD’s architecture adheres to the Copy-On-Write (CoW) principle
write elision strategy. This procedure evaluates various factors,
to manage shared storage updates without in-place modifications,
including current user load, the degree of modifications on the
using a log index mechanism. This allows RO nodes to dynami-
dirty page, and existing memory utilization.
cally reconstruct pages using base pages and redo logs instead of
replicating pages with each modification, thus maintaining logical (3) Pages selected for write elision have their respective redo logs
versioning with conventional storages. Given that most database compiled from the Sequence Redo Hash by page ID into the
kernels operate at the page level, this design is widely applicable. Page Redo Hash, and are then incorporated into the Flush Pool,
Additionally, CloudJump is deeply rooted in commercial products, exempting them from the current flush cycle. Pages not chosen
particularly focusing on MySQL’s InnoDB engine, where signifi- adhere to conventional flush protocols.
cant engineering has improved the B+-trees. Upcoming subsections (4) Upon subsequent access post-IO completion, the correspond-
will explore these technical enhancements in detail, emphasizing ing Page Redo Hash is extracted from the Flush Pool, with the
optimized logical page assembly and management. relevant redo logs applied.
(5) Pages within the Flush Pool are propelled to persisted stor-
4.1 Write Elision age upon fulfilling flush prerequisites, whether through dirty
In the architecture of WAL-based database engines, modifications flushing or via periodic inspections by a write elision flusher.
to a page typically generate a concise redo record, rendering the Thereafter, they are expunged from the Page Redo Hash.
page “dirty” within the Buffer Pool. Owing to the Buffer Pool’s fi- Specifically, the Sequence Redo Hash handles two principal types
nite capacity, the lack of available space necessitates a strategy of data: conventional redo data and log index data, the latter derived
like Least Recently Used (LRU) for page eviction selection. Should from asynchronous parsing. MVD aims to reduce non-essential ac-
an eviction-targeted page be dirty, it requires disk writing first, tivities by efficiently transferring redo data from the Redo Buffer,
initiating a page-sized write IO operation. In scenarios where the which the log writer flushes, utilizing the unified architecture of
data volume substantially exceeds the Buffer Pool’s capacity, this redo logs to avoid extra contention costs. Meanwhile, Log Index
event becomes commonplace: each page, once loaded into the Buffer data are generated from a buffer copy prior to flushing by the asyn-
Pool and minimally modified, faces rapid eviction and disk writ- chronous parser, and incorporate information from the standard
ing. This sequence of minor modifications resulting in significant operational memory cache. The memory designated for write eli-
write IO operations leads to IO resource wastage and may evolve sion is limited, necessitating evolution in the Sequence Redo Hash
into a database performance bottleneck, indicative of an IO-Bound to eliminate obsolete data, proposing two management strategies
Z. Chen et al.

for surplus requests. One strategy uses the log index to locate cor- involves using the maintained redo records in the hash map to
responding redo files, a process often deemed inefficient due to replay onto page contents, thus obtaining the latest page version.
the resultant increase in IO operations, potentially negating or The process encompasses reading, parsing, and applying all active
worsening the reductions in page IO provided by write elision. Redos. The unpredictability in timing mainly stems from:
Alternatively, flushing pages from the Flush Pool before discard- • Volume of redo: The time to read, parse, and process trans-
ing historical data could impair the efficacy of delaying dirty page actions correlates with the active redo log volume. Buffer Pool
flushing. The fundamental assumption of write elision rests on the aggregates pages, leading to multi-gigabyte checkpoints. Exces-
efficiency derived from consolidating multiple IO requests for the sive load, IO blocking, or software bugs may exacerbate delays in
same page within the Flush Pool, concurrently managing the cache checkpoint completion, thereby extending the time consumption.
load with pages outside the Flush Pool to prevent undue occupancy. • Page IO amplification: Pages are modified across different redo
Write elision is a technique developed to enhance the manage- segments following the access sequence. Exceeding the Buffer
ment of dirty pages in database systems by reducing unnecessary Pool’s capacity necessitates frequent swap to shared storage, com-
I/O operations and improving performance. This method involves pounded by memory usage of extra hash maps during recovery,
precise criteria to determine when a page should be flushed, thus further exacerbating IO amplification.
reducing the buildup of small redo logs that contribute to page
• Underutilization of storage characteristics: IO operations,
dirtying. A critical component of this methodology is tracking the
particularly with logs and page recycling, incur higher latency
amount of redo data per page, typically monitored with mtr and a
on distributed storage systems compared to local disks, requiring
set byte threshold of 255. This approach ensures that performance
more concurrent reads to offset delays.
remains unaffected in environments not limited by I/O. It includes
evaluating various metrics, such as the dirty page ratio in the Buffer During practical deployment practices, instances have experi-
Pool, to ensure optimal performance under different conditions. enced prolonged redo phases, resulting in extended periods of un-
Pages with significantly outdated oldest_modification_lsn are availability. The capability of single-page recovery in MVD enables
prioritized for flushing. These pages are likely to be flushed by the deferment of the time-consuming redo phase until after service
background operations if modifications are postponed, negating provision. By leveraging the page-oriented nature of redo logs and
the benefits of delayed flushing. Additionally, the implementation the high throughput of distributed shared storage, it is possible to
constraints require that a page’s newest_modification_lsn re- significantly reduce downtime and potentially accelerate overall
main below the position of the asynchronous parser and that the completion time in the background. The modified process involves:
oldest_modification_lsn stay ahead of the lower limit of the (1) Starting the scan from the position where the log index is al-
Sequence Redo Hash. Resource constraints also apply, such as when ready generated, rather than from the checkpoint, as log index
memory allocation fails to secure necessary resources. Adhering typically remains in a relatively closer position, unrelated to
to these principles allows the write elision strategy to efficiently the length of active redo logs.
manage databases while ensuring high performance across complex (2) Completing the log index during the scan without maintaining
operational environments. the previous In-memory Redo Hash.
(3) Scanning from ib_parsemeta to identify which pages are in-
4.2 Instant Recovery
volved in post-checkpoint redo logs, marking these as “register
Fault recovery is a crucial feature of database systems, designed
pages”. Instead of applying pages synchronously during startup,
to restore committed data that was not yet persisted before inter-
the application to pages is postponed until after the instance
ruptions, by using redo logs. This function is vital not only for
becomes operational.
recovering from system failures but also supports various adminis-
(4) Subsequently, a background thread batch-processes the appli-
trative tasks over the product’s lifespan, especially when significant
cation recovery for pages.
configuration changes require a database restart. The speed of fault
recovery is paramount because it affects how quickly users can Moreover, if a user accesses any page from the register pages
access the database and their overall experience. For instance, in beforehand, all necessary redo logs must be applied to this page
MySQL, the undo phase occurs quietly after service restart, avoid- before it is returned. Whether for foreground or background ap-
ing any extension to the discussed startup time. The real factors plication, this process requires accessing all necessary redo logs
contributing to downtime are setting up necessary data structures for the page through log index, similar to the RO’s log index read-
and allocating memory. However, the most time-intensive part ing process previously mentioned, with the restored page being
involves performing and applying redo operations. removed from register pages. This workflow demonstrates how
instant recovery can significantly advance the timing of instance
4.2.1 Primary Recovery Process. The recovery process generally
recovery and service provision from completing an extensive Redo
follows these steps. Initially, the current checkpoint position is read
Apply to just scanning a minimal segment of redo logs.
from the ib_checkpoint file. Starting from this checkpoint, a se-
quential scan of redo logs is performed until the last complete mtr is 4.2.2 Segment-oriented Recovery. Instant recovery utilizes a back-
found. The redo of the last incomplete mtr is then truncated, ensur- ground process that applies pages while mitigating IO amplification
ing the atomicity of the mtr. Throughout this scan, all encountered issues. Nevertheless, this process introduces additional redo read IO
redo logs are parsed and maintained in an in-memory hash map, for log indexes and redo logs due to frequent, fine-grained access.
sorted by page. The scan either concludes or triggers an exception Such slow IO can impede the speed of page recovery in the back-
of Page Apply if the hash map consumes excessive memory. This ground, negatively impacting normal user operations over time. To
CloudJump II: Optimizing Cloud Databases for Shared Storage

Table 1: Comparison of Recovery Stages The capability offered by multi-versioning to obtain a specific
Stage Instant Recovery version of a page on demand naturally solves this problem. The core
idea behind One-Pass Restore is to perform restoration according to
Without With the sequence of Pages rather than the sequence of redo log access.
Redo Scans • All Active redo log • Only after log index As the name One-Pass implies, each page undergoes just one read
• Single-thread • Fetch Register Page IO and one write IO in the process, applying all the necessary
from ib_parsemeta redo logs. During the One-Pass process for a page, it is essential
Redo Parse • All Active redo log to access all Redo content for that page through multi-version log
• Single-thread indexes. Given that redo logs for different pages are interwoven
Redo Apply • Synchronous • Asynchronous background in continuous Redo files, and considering the segmented sorting
• In Redo sequence • By segments characteristic of log indexes, it is crucial to avoid additional IO
• Limited parallelism • Intra-segment concurrency amplification from accessing redo logs or log indexes. To this end,
a log merging strategy has been implemented, focusing on:
When available? After Redo Apply After Redo Scans
(1) Log index merging: By extending the log index format to
support storing redo content directly, in addition to the location
tackle this issue, a common method is to enhance redo logs and
information of redo logs, this approach eliminates the overhead
log indexes with memory caches. However, the page restoration
of random access of redo logs after obtaining the log index.
process might involve a broad array of redo logs, resulting in in-
(2) Intra-file segment merging: As mentioned, an individual
sufficient differentiation of data “hotness”—an essential factor for
ib_parsedata may contain multiple segments, which are or-
effective cache use. An alternative, multi-version instant recovery,
dered by page within each segment but not connected between
avoids direct page restoration to their latest version during back-
segments. The first step of One-Pass Restore is to merge these
ground recovery. Instead, it utilizes a segment-oriented recovery
segments within a file to achieve overall order. This is accom-
strategy alongside caching. Here, the memory cache orderly retains
plished through direct parsing of redo logs, which supports
segments of redo logs and their corresponding log index entries,
backup and restoration of historical instances, including those
from older to newer, within a set memory limit. Following this,
without multi-versioning.
background threads restore all pages within these segments, re-
(3) Inter-file log index multi-way merging: Before restoring
peating the cycle until all targeted segments are fully restored. Due
pages, a multi-way merging of all log index files is conducted
to this approach, pages in the Buffer Pool may reflect intermediate
to sequentially obtain all redo logs for each page.
states, which can result in three potential outcomes:
This comprehensive strategy effectively mitigates IO amplifica-
(1) Subsequent recovery segments continue the restoration process,
tion issues in IO-bound scenarios, ensuring efficient and stream-
applying redo logs in sequence based on the current state.
lined data recovery processes.
(2) User request access necessitates the preemptive application of
all subsequent redo logs, accepting potential delays. 4.3 Elastic Scaling of RO Node
(3) Before further access, pages are flushed or evicted, after which In high-stress scenarios, the expedited scaling of RO nodes is com-
they can be reloaded and accessed anew, either continuing with monly pursued to enhance horizontal read scalability through rapid
the recovery process or being accessed by users. augmentation of RO instances. Traditionally, integrating a new RO
node necessitated initiating a synchronization process from the
In segment-oriented recovery, background application threads
data flush in RW node’s Buffer Pool. To incorporate a new RO node
initiate the process by loading essential memory segments and redo
while mitigating the risk of failure due to inadequate memory space
logs from ib_parsedata into a Shared Recv Cache. This cache is
of the redo cache, an enforced checkpoint to the most recent posi-
then available to all threads that need log index entries for page re-
tion was triggered upon the RO node’s addition. This checkpoint
covery, covering both user and background threads. These threads
primarily depended on the Buffer Pool’s flushing of the oldest dirty
prioritize retrieving data from the Shared Recv Cache for log access.
pages, a procedure limited by the positions of other RO nodes and
This strategy significantly reduces the I/O requirements during the
thus subject to unpredictable durations.
restoration of background pages, thus speeding up the recovery
Furthermore, under substantial write loads, this methodology
process. Additionally, as log segments are processed, the system’s
could intensify the likelihood of sequential RO node failures. With
checkpoint progresses, addressing the problem of checkpoint stag-
multiple RO nodes within a cluster, the integration of a new RO node
nation during prolonged recovery operations. We detail in Table 1
or the reboot of a previously failed one awaiting registration (during
the principal differences between the instant recovery facilitated
the checkpoint process) would restrict its applied LSN position. This
by MVD, relevant to this section, and traditional recovery methods.
restriction could precipitate In-memory Redo Hash expansion or
4.2.3 One-Pass Restore. In IO-bound scenarios, the restoration pro- even out-of-memory (OOM) failures in other RO nodes tracking the
cess, following the order of redo log access, can lead to the repetitive write LSN closely, engendering a cycle of recurrent crashes.
loading of the same page into the Buffer Pool. This process—applying The deployment of a multi-version log index alters this para-
a small redo segment, writing to disk, and swapping out—induces digm. Upon connection to the RW node, a new RO node is no longer
significant IO amplification and inefficiency, constrained by IO ca- compelled to delay for a flush; rather, it can initiate a replication re-
pacity. Mitigating this issue requires addressing IO amplification lationship from the position dictated by the log index. Any missing
for individual pages effectively. redo logs can be sourced from the log index as necessary. Through
Z. Chen et al.

a solitary log index flush, the goal is to synchronize this position Table 2: Evaluated Benchmarking Cases
closely with the RW node’s current write LSN, fostering the premise Benchmark Scenario Dataset Size
that multi-version RO nodes can begin replication from a position
proximate to the RW node’s latest write activity. SysBench Sample OLTP 30GB/300GB
TPC-C Standard OLTP 30GB/300GB
4.4 Rapid Backtrack Hotspots Live Commerce customer 100GB
Point-in-time restoration initiates a new database instance. Despite Multi-Index Table SaaS customer 100GB
optimizations like One-Pass Restore, this may still require up to ×105
4.0
several tens of minutes from start to finish. During this period, the -3.7%
3.5
new instance cannot handle user requests. Such restoration times

Queries per Second (QPS)

may be acceptable for anticipated recovery needs, but in more ur- 3.0

gent scenarios, such as a need for rapid data rollback due to user 2.5
error, extended downtime is untenable and can halt user opera- 2.0
tions. Additionally, the presence of an additional instance incurs 7.4%
1.5
increased costs. Thus, we aim to offer backtrack capabilities to sup- -2.0% -0.3%
12.4%
12.1%
1.0 17.9%
port rapid restoration of the current instance. Backtrack prioritizes
20.9%
service resumption, delaying the time-consuming page processing 0.5

to the background and accepting a temporary performance degra- 0.0

SysBench SysBench SysBench SysBench TPCC TPCC Multi With
dation post-restoration. With backtrack enabled, users can issue a RW 30G WO 30G RW 300G WO 300G 30G 300G Index Hotspots
"Backtrack to Timestamp" command via the console when needed, Vanilla MVD
reverting to a specified point in time. The instance will then restart, Figure 8: Results of query processing performance.
and upon reboot, reflect the state at the designated timestamp. The
backtrack process for MVD is as follows: 90% of database instances. Additionally, as RO nodes operate inde-
(1) During operations, the need to generate an older version of pendently, their inclusion theoretically does not impact the MVD.
the page is assessed when flushing dirty pages to the .ibd file. The primary source of performance disparity stems from increased
If no older version is needed, the process proceeds as usual. read pressure on shared storage, which is not the main focus of this
Otherwise, previous versions of the page from the .ibd file are study. Currently, Alibaba Cloud’s retail product offers a maximum
first loaded into a historical page management file before the cluster size of 16 nodes. The cluster utilizes shared storage through
regular flushing of the .ibd file continues. PolarFS@PolarStore [10], a POSIX-compliant distributed file sys-
(2) If a backtrack request is made, the request’s scope is recorded tem on a cloud block service, featuring 50 ChunkServers, each with
in the backtrack history, and appropriate checkpoint locations a capacity of 10GB, and a provision of 108K IOPS. Stress testing
and restart states are set before shutdown. conducts using an ECS equipped with 16 cores and 64GB memory.
(3) Upon instance restart, the state determines if it initiates as a Our evaluation encompasses two industry-standard benchmarks:
backtrack. Employing a methodology similar to instant recov- SysBench [30], covering write-only and read-write scenarios, and
ery, it first resumes replication, then uses background threads TPC-C [14]. We established evaluations for both medium-scale and
to restore necessary pages, i.e., reinstates pages modified from high-intensity data sizes (30GB and 300GB, respectively). Moreover,
the checkpoint to the backtrack position. we conducted tests on a broad dataset representative of typical
(4) For user-driven and background repair reads, the operation first online customer scenarios, highlighting hotspots. This workload
retrieves the target page from the .ibd file. It then traverses the is attributed to write operations on tables featuring hot data up-
page’s version chain for an appropriate version if required. Next, dates. Such a pattern is exceedingly common across industries like
relevant redo logs are applied to the selected page version for e-commerce, especially in transaction processing with counting
restoration, obviating immediate .ibd file version overwrites, functionalities. Additionally, we tested multi-index tables, a work-
which are instead replaced during future page flushes. load due to updates on tables with multiple secondary indexes, a
(5) As the configured backtrack window progresses, historical page practice consistently adopted and utilized within the SaaS industry.
versions no longer needed are gradually purged. All benchmarking cases are summarized in Table 2.

5.2 Query Processing Performance

5 EVALUATION
We first focuses on the foremost consideration in measuring data-
5.1 Experimental Setups base performance: query response performance, quantified by queries
Our evaluations are conducted on an actively marketed PolarDB per second (QPS). The performance comparison between the Cloud-
database cluster consisting of one leader node with read-write ca- Jump II framework integrated with MVD, and a vanilla setup across
pabilities and two follower nodes for read-only operations. Each various workloads is illustrated in Figure 8. It is observed that under
node is equipped with 16 Intel Xeon Platinum 8369B CPU cores and low workload conditions, the inclusion of MVD does not enhance
64GB of DDR4 memory at 3200 MT/s. The database Buffer Pool allo- performance and may even slightly degrade it. This phenomenon
cates 75% of the total memory, with the remaining 25% reserved for occurs because the low work pressure means that the performance
operational execution. We note that such a configuration (1RW+2RO) issues caused by traditional strategies to ensure data consistency
reflects a typical online deployment scenario, accounting for over in shared storage are not significant. Consequently, the benefits
CloudJump II: Optimizing Cloud Databases for Shared Storage

10 ×105 CPU-bound low workload ×105 CPU-bound high workload

2 2

-47.4%
8
Memory Usage (GBs)

-34.4% 1 1
6 -25.9%
-21.6%

Queries per Second (QPS)

-18.9% -23.1% 0 0
4
-28.6% 0 100 200 300 400 0 100 200 300 400
-13.0%
4 4
×10 IO-bound low workload ×10 IO-bound high workload
2
4 4
0
SysBench SysBench SysBench SysBench TPCC TPCC Multi With
RW 30G WO 30G RW 300G WO 300G 30G 300G Index Hotspots 2 2
Vanilla MVD
0 0
Figure 9: Results of memory usage. 0 100 200 300 400 0 100 200 300 400
Time elapsed since RW restarted (Seconds)
do not offset the complexity introduced by MVD, such as the over- Vanilla MVD
head associated with log index. However, beyond this, under higher
loads, MVD consistently facilitates a clear improvement in QPS, Figure 10: Results of RW recovery.
ranging between 7%-20%. This enhancement stems from two pri- the instant recovery offered by MVD, in the context of recovery
mary sources. Firstly, the incorporation of MVD effectively disrupts scenarios. The vanilla method employs a full synchronous recovery,
the detrimental loop described in Figure 4, preventing delays in allowing services to commence only after all redo logs have been
refreshing pages in persistent storage due to the slowest RO node. applied. Consequently, its performance across various scenarios ex-
This allows for fewer catch-up actions on redo logs after RO acquire hibits a stable two-phase characteristic, with a leap to peak QPS per-
pages, reducing system resource consumption. Secondly, the MVD formance following the completion of the Redo Apply phase. In con-
enables write elision detailed in Section 4.1, significantly reducing trast, the recovery process with MVD’s instant recovery presents
IO amplification for minor page modifications. This factor becomes a more complex pattern. Notably, it introduces an intermediate
even more pronounced in IO-bound scenarios, as exemplified in stage between the commencement of service and reaching peak
the multi-index scenario. QPS, where Redo Apply operations are executed asynchronously
in the background while queries are being responded to. For CPU-
5.3 Memory Usage bound scenarios, this results in a gradual climb to peak QPS as
We next assess the situation regarding memory usage, as illustrated more pages become updated and ready to serve during this process.
in Figure 9. It is evident that the introduction of MVD technology However, in IO-bound scenarios, the performance demonstrates a
significantly reduces the consumption levels of operational memory square wave pattern due to the sharing of limited bandwidth be-
across various types of workloads, with reductions ranging from tween queries and reconstruction efforts. Critically, MVD enables
13% to 47%. This outcome aligns with expectations, as the intro- service to resume earlier, becoming operational immediately after
duction of multi-version capabilities effectively breaks the cycle of the Redo Scans stage. This reduces service downtime caused by
data updates required to maintain consistency in single-version, RW failure to 20%-30% of that observed with the vanilla method,
overwrite-based shared storage systems, thereby reducing the ne- which requires the full execution of the Redo Apply phase. This
cessity to maintain an increased number of dirty pages and redo improvement significantly enhances service availability in disaster
logs in the database compute nodes’ memory. It is imperative to scenarios, a key consideration for robust IT infrastructure design.
emphasize that, in the context of elastic resource allocation in cloud
environments, memory usage is often the most critical factor for 5.5 RO Online Scaling
cost optimization due to its high cost and the relative inflexibility We also adopt an evaluation similar to that in Section 5.4 to access
of dynamic allocation. The significant decrease in memory usage the scalability of read performance in a cluster upon registering a
resulting from the adoption of MVD technology will directly im- new RO node, as shown in Figure 11. It is evident that, across all eval-
pact the cost of corresponding cloud database deployments. This uation scenarios, MVD demonstrates shorter initialization times
cost reduction is especially relevant in a “pay-as-you-go” or when and quicker attainment of peak QPS when compared to the vanilla.
deploying additional instances at a specific physical scale, substan- This advantage stems from MVD’s relaxation of the constraints iden-
tially minimizing the likelihood of instance failure due to OOM tified in Figure 4, where any RO node joining the cluster must not
under the premise of pre-allocated memory. lag behind the dirty page flushes (i.e., newest_modification_lsn)
on the shared storage, to avoid triggering catastrophic inconsisten-
5.4 RW Disaster Recovery cies as depicted in Figure 3. Upon connecting to the RW, a new RO
We use SysBench to construct both CPU-bound and IO-bound sce- is no longer compelled to await flush completion; instead, it can
narios under low and high load conditions, aiming to assess the initiate the replication process from any position specified by log
cluster’s cold start and the recovery of a RW following the leader index. Furthermore, we also observe that in CPU-bound scenarios,
node failure, as illustrated in Figure 10. We refer readers to Table 1 both methods exhibit a gradual increase in QPS, attributable to the
for the procedural distinctions between the vanilla method and warming up of the Buffer Pool within RO nodes.
Z. Chen et al.

×105 CPU-bound low workload ×105 CPU-bound high workload

2 2 6 RELATED WORK
Storage Disaggregation in Databases. Storage disaggregation
1 1 marks a paradigm shift in database architecture from conventional,
unified storage systems to a distributed model that enhances scal-
ability, performance, and resource efficiency. This strategy sep-
Queries per Second (QPS)

0 0
0 50 100 150 0 50 100 150
arates storage from computational resources, enabling dynamic
4 4
storage scaling without affecting computing capacity. Stemming
×10 IO-bound low workload ×10 IO-bound high workload
from this transition, numerous successful and innovative database
designs have emerged, e.g., Aurora [50], Cornus [23], Hailstorm [7],
4 4
Socrates [5], Taurus [17]. Storage disaggregation promotes ad-
2 2 vanced data management policies, including tiering and automated
lifecycle management [9, 22, 25, 28], and strengthens fault toler-
0 0 ance and disaster recovery through flexible replication and backup
0 50 100 150 0 50 100 150
strategies [19, 27, 57]. It addresses challenges such as increasing
Time elapsed since RO registered (Seconds)
data volumes, the necessity for high availability, and the demand
Vanilla MVD
for cost-effective scalability, facilitating efficient management of
Figure 11: Results of RO online scaling. diverse workloads through optimized data placement and access.
×102
3.0 Shared Storage Architectures. In distributed systems, shared
storage architecture serves as the cornerstone of cloud computing,
2.5
Time consumption (Seconds)

-83.3% designed to facilitate scalable, efficient, and fault-tolerant storage

2.0 backup present management for big data. The exploration of database systems built

1.5
↓ -75.5%
↓ on shared storage can trace back to RAID [41]. With the rapid expan-
sion of data and advancements in network and storage technologies,
1.0
-56.2% techniques such as fractional repetition coding [29, 42, 44, 48] have
enhanced the system’s repair capabilities and adaptability. This
0.5 has spurred a shift from localized storage solutions to sophisti-
cated distributed and parallel frameworks, addressing challenges in
0.0
-5GB -3GB -1GB 0 data consistency, security, and error management. Such evolution
Log size of backtrack distance has led to the emergence of seminal works, including GPFS [45],
Vanilla MVD GFS [21], Ceph [1], Dynamo [16], and PolarFS [10].
Figure 12: Results of backtrack performance. Data Consistency and Synchronization. Ensuring data con-
sistency and synchronization in distributed systems, particularly
5.6 Rapid Backtrack within primary-secondary architectures, mandates the amalgama-
Finally, we assessed MVD’s efficacy in database instance backtrack, tion of diverse methodologies, including multi-version concurrency
with findings in Figure 12. We craft an experiment for retrospective control [32, 47, 62], replication protocols [3, 43], and consensus
database navigation, using redo log data volume as measure of algorithms, notably Paxos [11] and Raft [39]. This body of research
the temporal span for backtrack. Unlike vanilla methods lacking a underscores the ongoing development in distributed data manage-
forward traversal capability in persisted states, in which historical ment aimed at optimizing consistency, availability, and resilience.
version recovery requires the presumption of a full backup that is in
an earlier state compared to the target, followed by restoration and 7 CONCLUSION
backward redo log traversal. Therefore, we assume the existence of The journey of Alibaba Cloud’s CloudJump framework through the
a full backup earlier than the “present” by 5GB of redo logs. For a intricate landscape of compute-storage disaggregation elucidates a
-5GB backtrack, the vanilla method reconstructs the instance based path toward redefining cloud database scalability, resilience, and
on this backup; for -3GB, it applies 2GB of redo logs after restoring efficiency. By leveraging standard cloud storage services to foster
the backup. Conversely, MVD employs a rapid backtrack approach a shared-storage architecture, this study underscores the pivotal
(Section 4.4), significantly reducing retrospection time to 44% of role of Multi-Version Data (MVD) technology in addressing the
conventional full backup reconstruction for -5GB. For non-exact paramount challenge of data consistency across compute nodes.
backup matches, as in -3GB and -1GB cases, redo log application MVD enhances operational simplicity, node consistency, recovery,
increases overhead. MVD’s backtrack time remains relatively unaf- and data persistence without a custom storage layer. The practical
fected due to the inherited capability for early startup from instant application and benefits of MVD technology, as demonstrated in
recovery, coupled with One-Pass Restore, thus circumventing the the PolarDB product, signify a leap toward optimizing cloud-native
performance inefficiencies associated with repetitive page swap- database systems, ensuring their adaptability, high availability, and
ping during the sequential log application phase from the Buffer cost-efficiency in the face of evolving digital demands. This paper
Pool. Moreover, the presence of multi-version page management en- contributes to the ongoing discourse in the cloud database domain
ables the direct retrieval of an earlier version of a page, as opposed by offering a nuanced understanding of shared storage solutions
to its reconstruction. and their impact on cloud-native databases.
CloudJump II: Optimizing Cloud Databases for Shared Storage

REFERENCES using operator state management. In SIGMOD Conference. ACM, 725–736.

[1] Abutalib Aghayev, Sage A. Weil, Michael Kuchnik, Mark Nelson, Gregory R. [20] Peter Frühwirt, Peter Kieseberg, Sebastian Schrittwieser, Markus Huber, and
Ganger, and George Amvrosiadis. 2019. File systems unfit as distributed storage Edgar R. Weippl. 2013. InnoDB database forensics: Enhanced reconstruction
backends: lessons from 10 years of Ceph evolution. In SOSP. ACM, 353–369. of data manipulation queries from redo logs. Inf. Secur. Tech. Rep. 17, 4 (2013),
[2] Divyakant Agrawal, Amr El Abbadi, Sudipto Das, and Aaron J. Elmore. 2011. 227–238.
Database Scalability, Elasticity, and Autonomy in the Cloud - (Extended Abstract). [21] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. 2003. The Google file
In DASFAA (1) (Lecture Notes in Computer Science), Vol. 6587. Springer, 2–15. system. In SOSP. ACM, 29–43.
[3] Raihan Al-Ekram and Richard C. Holt. 2010. Multi-consistency Data Replication. [22] Jorge Guerra, Himabindu Pucha, Joseph S. Glider, Wendy Belluomini, and Raju
In ICPADS. IEEE Computer Society, 568–577. Rangaswami. 2011. Cost Effective Storage using Extent Based Dynamic Tiering.
[4] Amazon Web Services. 2024. Amazon Elastic File System. [Link] In FAST. USENIX, 273–286.
[Link]/efs/ Accessed: 2024-12-01. [23] Zhihan Guo, Xinyu Zeng, Kan Wu, Wuh-Chwen Hwang, Ziwei Ren, Xiangyao
[5] Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernan- Yu, Mahesh Balakrishnan, and Philip A. Bernstein. 2022. Cornus: Atomic Commit
dez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, for a Cloud DBMS with Storage Disaggregation. Proc. VLDB Endow. 16, 2 (2022),
Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chai- 379–392.
tanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and [24] Jacob Gorm Hansen and Eric Jul. 2010. Lithium: virtual machine storage for the
Vikram Wakade. 2019. Socrates: The New SQL Server in the Cloud. In SIGMOD cloud. In SoCC. ACM, 15–26.
Conference. ACM, 1743–1756. [25] Herodotos Herodotou and Elena Kakoulli. 2019. Automating Distributed Tiered
[6] Alysson Neves Bessani, Ricardo Mendes, Tiago Oliveira, Nuno Ferreira Neves, Storage Management in Cluster Computing. Proc. VLDB Endow. 13, 1 (2019),
Miguel Correia, Marcelo Pasin, and Paulo Veríssimo. 2014. SCFS: A Shared 43–56.
Cloud-backed File System. In USENIX Annual Technical Conference. USENIX [26] Peter J. Keleher, Alan L. Cox, Sandhya Dwarkadas, and Willy Zwaenepoel. 1994.
Association, 169–180. TreadMarks: Distributed Shared Memory on Standard Workstations and Operat-
[7] Laurent Bindschaedler, Ashvin Goel, and Willy Zwaenepoel. 2020. Hailstorm: ing Systems. In USENIX Winter. USENIX Association, 115–132.
Disaggregated Compute and Storage for Distributed LSM-based Databases. In [27] Bettina Kemme and Gustavo Alonso. 2010. Database Replication: a Tale of
ASPLOS. ACM, 301–316. Research across Communities. Proc. VLDB Endow. 3, 1 (2010), 5–12.
[8] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam [28] Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle,
McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev and Christos Kozyrakis. 2018. Pocket: Elastic Ephemeral Storage for Serverless
Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Be- Analytics. In OSDI. USENIX Association, 427–444.
dekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muham- [29] Joseph C. Koo and John T. Gill III. 2011. Scalable constructions of fractional
mad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, repetition codes in distributed storage systems. In Allerton. IEEE, 1366–1373.
Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. 2011. [30] Alexey Kopytov. 2024. Sysbench: A system performance benchmark. Accessed:
Windows Azure Storage: a highly available cloud storage service with strong 2024-03-15, [Link]
consistency. In Proceedings of the 23rd ACM Symposium on Operating Systems [31] Tim Kraska, Martin Hentschel, Gustavo Alonso, and Donald Kossmann. 2009.
Principles 2011, SOSP 2011, Cascais, Portugal, October 23-26, 2011, Ted Wobber and Consistency Rationing in the Cloud: Pay only when it matters. Proc. VLDB Endow.
Peter Druschel (Eds.). ACM, 143–157. [Link] 2, 1 (2009), 253–264.
[9] Ignacio Cano, Srinivas Aiyar, Varun Arora, Manosiz Bhattacharyya, Akhilesh [32] Tim Kraska, Gene Pang, Michael J. Franklin, Samuel Madden, and Alan D. Fekete.
Chaganti, Chern Cheah, Brent N. Chun, Karan Gupta, Vinayak Khot, and Arvind 2013. MDCC: multi-data center consistency. In EuroSys. ACM, 113–126.
Krishnamurthy. 2017. Curator: Self-Managing Storage for Enterprise Clusters. [33] Feifei Li. 2019. Cloud native database systems at Alibaba: Opportunities and
In NSDI. USENIX Association, 51–66. Challenges. Proc. VLDB Endow. 12, 12 (2019), 2263–2272.
[10] Wei Cao, Zhenjun Liu, Peng Wang, Sen Chen, Caifeng Zhu, Song Zheng, Yuhui [34] Guoliang Li, Haowen Dong, and Chao Zhang. 2022. Cloud Databases: New
Wang, and Guoqing Ma. 2018. PolarFS: An Ultra-low Latency and Failure Resilient Techniques, Challenges, and Opportunities. Proc. VLDB Endow. 15, 12 (2022),
Distributed File System for Shared Storage Cloud Database. Proc. VLDB Endow. 3758–3761.
11, 12 (2018), 1849–1862. [35] Hyeontaek Lim, David G. Andersen, and Michael Kaminsky. 2016. Towards
[11] Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. 2007. Paxos Accurate and Fast Evaluation of Multi-Stage Log-structured Designs. In FAST.
made live: an engineering perspective. In PODC. ACM, 398–407. USENIX Association, 149–166.
[12] Zongzhi Chen, Xinjun Yang, Feifei Li, Xuntao Cheng, Qingda Hu, Zheyu Miao, [36] Simon Loesing, Markus Pilman, Thomas Etter, and Donald Kossmann. 2015. On
Rongbiao Xie, Xiaofei Wu, Kang Wang, Zhao Song, Haiqing Sun, Zechao Zhuang, the Design and Scalability of Distributed Shared-Data Databases. In SIGMOD
Yuming Yang, Jie Xu, Liang Yin, Wenchao Zhou, and Sheng Wang. 2022. Cloud- Conference. ACM, 663–676.
Jump: Optimizing Cloud Databases for Cloud Storages. Proc. VLDB Endow. 15, [37] C. Mohan, Don Haderle, Bruce G. Lindsay, Hamid Pirahesh, and Peter M. Schwarz.
12 (2022), 3432–3444. 1992. ARIES: A Transaction Recovery Method Supporting Fine-Granularity
[13] James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Locking and Partial Rollbacks Using Write-Ahead Logging. ACM Trans. Database
Frost, J. J. Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Syst. 17, 1 (1992), 94–162.
Peter Hochschild, Wilson C. Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi [38] Gihwan Oh, Chiyoung Seo, Ravi Mayuram, Yang-Suk Kee, and Sang-Won Lee.
Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quin- 2016. SHARE Interface in Flash Storage for Relational and NoSQL Databases. In
lan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher SIGMOD Conference. ACM, 343–354.
Taylor, Ruth Wang, and Dale Woodford. 2012. Spanner: Google’s Globally- [39] Diego Ongaro and John K. Ousterhout. 2014. In Search of an Understandable Con-
Distributed Database. In 10th USENIX Symposium on Operating Systems De- sensus Algorithm. In USENIX Annual Technical Conference. USENIX Association,
sign and Implementation, OSDI 2012, Hollywood, CA, USA, October 8-10, 2012, 305–319.
Chandu Thekkath and Amin Vahdat (Eds.). USENIX Association, 251–264. https: [40] Justice Opara-Martins, Reza Sahandi, and Feng Tian. 2016. Critical analysis
//[Link]/conference/osdi12/technical-sessions/presentation/corbett of vendor lock-in and its impact on cloud computing migration: a business
[14] Transaction Processing Performance Council. 2024. TPC-C Benchmark. Ac- perspective. J. Cloud Comput. 5 (2016), 4.
cessed: 2024-03-15, [Link] [41] David A. Patterson, Garth A. Gibson, and Randy H. Katz. 1988. A Case for
[15] Sudipto Das, Shoji Nishimura, Divyakant Agrawal, and Amr El Abbadi. 2011. Redundant Arrays of Inexpensive Disks (RAID). In SIGMOD Conference. ACM
Albatross: Lightweight Elasticity in Shared Storage Databases for the Cloud Press, 109–116.
using Live Data Migration. Proc. VLDB Endow. 4, 8 (2011), 494–505. [42] Sameer Pawar, Nima Noorshams, Salim Y. El Rouayheb, and Kannan Ramchan-
[16] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, dran. 2011. DRESS codes for the storage cloud: Simple randomized constructions.
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, In ISIT. IEEE, 2338–2342.
and Werner Vogels. 2007. Dynamo: amazon’s highly available key-value store. [43] Sebastiano Peluso, Pedro Ruivo, Paolo Romano, Francesco Quaglia, and Luís E. T.
In SOSP. ACM, 205–220. Rodrigues. 2012. When Scalability Meets Consistency: Genuine Multiversion
[17] Alex Depoutovitch, Chong Chen, Jin Chen, Paul Larson, Shu Lin, Jack Ng, Wenlin Update-Serializable Partial Data Replication. In ICDCS. IEEE Computer Society,
Cui, Qiang Liu, Wei Huang, Yong Xiao, and Yongjun He. 2020. Taurus Database: 455–465.
How to be Fast, Available, and Frugal in the Cloud. In SIGMOD Conference. ACM, [44] Salim El Rouayheb and Kannan Ramchandran. 2010. Fractional repetition codes
1463–1478. for repair in distributed storage systems. In Allerton. IEEE, 1510–1517.
[18] Aaron J. Elmore, Sudipto Das, Divyakant Agrawal, and Amr El Abbadi. 2011. [45] Frank B. Schmuck and Roger L. Haskin. 2002. GPFS: A Shared-Disk File System
Zephyr: live migration in shared nothing databases for elastic cloud platforms. for Large Computing Clusters. In FAST. USENIX, 231–244.
In SIGMOD Conference. ACM, 301–312. [46] Yizhou Shan, Shin-Yeh Tsai, and Yiying Zhang. 2017. Distributed shared persis-
[19] Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter R. tent memory. In SoCC. ACM, 323–337.
Pietzuch. 2013. Integrating scale out and fault tolerance in stream processing [47] Jie Shao, Boxue Yin, Bujiao Chen, Guangshu Wang, Lin Yang, Jianliang Yan,
Jianying Wang, and Weidong Liu. 2016. Read Consistency in Distributed Database
Z. Chen et al.

Based on DMVCC. In HiPC. IEEE Computer Society, 142–151. In 7th Symposium on Operating Systems Design and Implementation (OSDI ’06),
[48] Natalia Silberstein and Tuvi Etzion. 2015. Optimal Fractional Repetition Codes November 6-8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.).
Based on Graphs and Designs. IEEE Trans. Inf. Theory 61, 8 (2015), 4164–4180. USENIX Association, 307–320. [Link]
[49] Junjay Tan, Thanaa M. Ghanem, Matthew Perron, Xiangyao Yu, Michael Stone- html
braker, David J. DeWitt, Marco Serafini, Ashraf Aboulnaga, and Tim Kraska. [56] Christian Winter, Jana Giceva, Thomas Neumann, and Alfons Kemper. 2022.
2019. Choosing A Cloud DBMS: Architectures and Tradeoffs. Proc. VLDB Endow. On-Demand State Separation for Cloud Data Warehousing. Proc. VLDB Endow.
12, 12 (2019), 2170–2182. 15, 11 (2022), 2966–2979.
[50] Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, [57] Timothy Wood, H. Andrés Lagar-Cavilla, K. K. Ramakrishnan, Prashant J. Shenoy,
Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz and Jacobus E. van der Merwe. 2011. PipeCloud: using causality to overcome
Kharatishvili, and Xiaofeng Bao. 2017. Amazon Aurora: Design Considerations speed-of-light delays in cloud-based disaster recovery. In SoCC. ACM, 17.
for High Throughput Cloud-Native Relational Databases. In SIGMOD Conference. [58] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. 2017. An
ACM, 1041–1052. Empirical Evaluation of In-Memory Multi-Version Concurrency Control. Proc.
[51] Hoang Tam Vo, Sheng Wang, Divyakant Agrawal, Gang Chen, and Beng Chin VLDB Endow. 10, 7 (2017), 781–792.
Ooi. 2012. LogBase: A Scalable Log-structured Database System in the Cloud. [59] Qizhen Zhang, Yifan Cai, Sebastian Angel, Vincent Liu, Ang Chen, and
Proc. VLDB Endow. 5, 10 (2012), 1004–1015. Boon Thau Loo. 2020. Rethinking Data Management Systems for Disaggre-
[52] Hiroshi Wada, Alan D. Fekete, Liang Zhao, Kevin Lee, and Anna Liu. 2011. Data gated Data Centers. In CIDR. [Link].
Consistency Properties and the Trade-offs in Commercial Cloud Storage: the [60] Qizhen Zhang, Yifan Cai, Xinyi Chen, Sebastian Angel, Ang Chen, Vincent Liu,
Consumers’ Perspective. In CIDR. [Link], 134–143. and Boon Thau Loo. 2020. Understanding the Effect of Data Center Resource
[53] Andrew Wang, Shivaram Venkataraman, Sara Alspaugh, Randy H. Katz, and Ion Disaggregation on Production DBMSs. Proc. VLDB Endow. 13, 9 (2020), 1568–
Stoica. 2012. Cake: enabling high-level SLOs on shared storage systems. In SoCC. 1581.
ACM, 14. [61] Zhanhao Zhao, Hexiang Pan, Gang Chen, Xiaoyong Du, Wei Lu, and Beng Chin
[54] Jianguo Wang and Qizhen Zhang. 2023. Disaggregated Database Systems. In Ooi. 2023. VeriTxn: Verifiable Transactions for Cloud-Native Databases with
SIGMOD Conference Companion. ACM, 37–44. Storage Disaggregation. Proc. ACM Manag. Data 1, 4 (2023), 270:1–270:27.
[55] Sage A. Weil, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long, and Carlos [62] Yue Zhuge, Hector Garcia-Molina, and Janet L. Wiener. 1997. Multiple View
Maltzahn. 2006. Ceph: A Scalable, High-Performance Distributed File System. Consistency for Data Warehousing. In ICDE. IEEE Computer Society, 289–300.

View publication stats

Cheng 等。 - 2022 - Steadily Decreasing Power Loss of a Double Schottk
No ratings yet
Cheng 等。 - 2022 - Steadily Decreasing Power Loss of a Double Schottk
12 pages
BugBench Benchmarks For Evaluating Bug Detection T
No ratings yet
BugBench Benchmarks For Evaluating Bug Detection T
6 pages
A Neuromorphic Processing System With Spike-Driven SNN Processor For Wearable ECG Classification
No ratings yet
A Neuromorphic Processing System With Spike-Driven SNN Processor For Wearable ECG Classification
14 pages
Lead-Free Piezoceramics Study
No ratings yet
Lead-Free Piezoceramics Study
8 pages
Selective Atrophy of Left Hemisphere and Frontal L
No ratings yet
Selective Atrophy of Left Hemisphere and Frontal L
11 pages
Meta Learning for Drug Discovery
No ratings yet
Meta Learning for Drug Discovery
14 pages
Behavior of Coronary Stents in Straight and Curved Arteries
No ratings yet
Behavior of Coronary Stents in Straight and Curved Arteries
8 pages
Effect of Composition of HPAM/Chromium ( ) Acetate Gels On Delayed Gelation Time
No ratings yet
Effect of Composition of HPAM/Chromium ( ) Acetate Gels On Delayed Gelation Time
9 pages
Soil Strain-Field Analysis Using Fiber Optics
No ratings yet
Soil Strain-Field Analysis Using Fiber Optics
11 pages
Alzheimer's Diagnosis via Enhanced Inception
No ratings yet
Alzheimer's Diagnosis via Enhanced Inception
8 pages
Hybrid Regression for House Price Prediction
No ratings yet
Hybrid Regression for House Price Prediction
6 pages
Surface Roughness Impact on Resonators
No ratings yet
Surface Roughness Impact on Resonators
8 pages
Assessment of Passenger Long-Term Vibration Discomfort A Field Study in High-Speed Train Environments
No ratings yet
Assessment of Passenger Long-Term Vibration Discomfort A Field Study in High-Speed Train Environments
15 pages
Face Stability of Shallow Tunnels
No ratings yet
Face Stability of Shallow Tunnels
14 pages
In-And Ga-Based Inorganic Double Perovskites With Direct Bandgaps For Photovoltaic Application
No ratings yet
In-And Ga-Based Inorganic Double Perovskites With Direct Bandgaps For Photovoltaic Application
6 pages
Many-objective Optimization via Decomposition
No ratings yet
Many-objective Optimization via Decomposition
16 pages
Octagonal Concrete-Filled Steel Tubes Analysis
No ratings yet
Octagonal Concrete-Filled Steel Tubes Analysis
10 pages
Complex Network Analysis of The Bitcoin Transaction Network
No ratings yet
Complex Network Analysis of The Bitcoin Transaction Network
7 pages
Changes in Biochemical and Microbiological Paramet
No ratings yet
Changes in Biochemical and Microbiological Paramet
11 pages
Facile Preparation of Graphene Oxide Membranes For Gas Separation
No ratings yet
Facile Preparation of Graphene Oxide Membranes For Gas Separation
8 pages
An Iterative Calculation Method For Suspension Bridge's Cable System
No ratings yet
An Iterative Calculation Method For Suspension Bridge's Cable System
10 pages
Investigation of RQD Variation With Scanline Lengt
No ratings yet
Investigation of RQD Variation With Scanline Lengt
14 pages
2022 Efecto Del Patrón de Escaneo en La Precisión de Las Impresiones de Implantes Digitales de Arco Completo Con Dos Escáner
No ratings yet
2022 Efecto Del Patrón de Escaneo en La Precisión de Las Impresiones de Implantes Digitales de Arco Completo Con Dos Escáner
10 pages
Enhance Leaky Wave Antennas with SIW
No ratings yet
Enhance Leaky Wave Antennas with SIW
6 pages
Characteristic Mode-Guided Trust-Region-Based Optimization For Mode Manipulation in Dual-Band Metantenna Design
No ratings yet
Characteristic Mode-Guided Trust-Region-Based Optimization For Mode Manipulation in Dual-Band Metantenna Design
11 pages
2025 An Improved Topology Identification Methodof Complex Dynamical Networksonline
No ratings yet
2025 An Improved Topology Identification Methodof Complex Dynamical Networksonline
10 pages
Developmento Speed Correction Factors
No ratings yet
Developmento Speed Correction Factors
9 pages
A Comprehensive Review of Sustainable Materials and Toolpath Optimization in 3D Concrete Printing
No ratings yet
A Comprehensive Review of Sustainable Materials and Toolpath Optimization in 3D Concrete Printing
15 pages
Catalytic Dehydrogenation of Propane To Propene: Catalyst Development, Mechanistic Aspects and Reactor Design
No ratings yet
Catalytic Dehydrogenation of Propane To Propene: Catalyst Development, Mechanistic Aspects and Reactor Design
17 pages
Tcyb21 HVP
No ratings yet
Tcyb21 HVP
11 pages
Capacitive Coupling Sensor Impact Analysis
No ratings yet
Capacitive Coupling Sensor Impact Analysis
9 pages
AERMOD Sensitivity Analysis for Air Quality
No ratings yet
AERMOD Sensitivity Analysis for Air Quality
5 pages
Defending Against New-Flow Attacks in SDN
No ratings yet
Defending Against New-Flow Attacks in SDN
13 pages
Xing Et Al. - 2013 - Influences of Sampling Methodologies On Pesticide-Residue Detection in Stream Water
No ratings yet
Xing Et Al. - 2013 - Influences of Sampling Methodologies On Pesticide-Residue Detection in Stream Water
14 pages
Flora of China Volume 2 3 Pteridophytes Introduction
No ratings yet
Flora of China Volume 2 3 Pteridophytes Introduction
13 pages
Coatings 13 00384 v2
No ratings yet
Coatings 13 00384 v2
16 pages
Accurate Analytical Calculation of The DC-link Capacitor Current For Three-Phase Motor Drive Under The Full Working Range
No ratings yet
Accurate Analytical Calculation of The DC-link Capacitor Current For Three-Phase Motor Drive Under The Full Working Range
7 pages
Elastic Tensor Symmetry Analysis
No ratings yet
Elastic Tensor Symmetry Analysis
12 pages
Zircon U-Pb Dating in South China Sea
No ratings yet
Zircon U-Pb Dating in South China Sea
15 pages
Anti-inflammatory Effects of Pomegranate
No ratings yet
Anti-inflammatory Effects of Pomegranate
9 pages
2018 Autophagy
No ratings yet
2018 Autophagy
18 pages
Carbon Nanowalls: Properties and Applications
No ratings yet
Carbon Nanowalls: Properties and Applications
10 pages
CFRP Sheets For Exural Strengthening of RC Beams: July 2011
No ratings yet
CFRP Sheets For Exural Strengthening of RC Beams: July 2011
10 pages
The Dynamic Response and Failure of Polycarbonate
No ratings yet
The Dynamic Response and Failure of Polycarbonate
10 pages
A Peep at Pornography Web in China
0% (1)
A Peep at Pornography Web in China
7 pages
Gaomiaozi Bentonite Buffer Properties
No ratings yet
Gaomiaozi Bentonite Buffer Properties
11 pages
Advances On BufferBackfill Properties of Heavily C PDF
No ratings yet
Advances On BufferBackfill Properties of Heavily C PDF
11 pages
Covert Mortality Nodavirus in Crustaceans
No ratings yet
Covert Mortality Nodavirus in Crustaceans
9 pages
Res TFG03
No ratings yet
Res TFG03
7 pages
Jiang Et Al-2020-Journal of General Internal Medicine1
No ratings yet
Jiang Et Al-2020-Journal of General Internal Medicine1
6 pages
2012, Thermal Performance of 3D IC Integration TSV
No ratings yet
2012, Thermal Performance of 3D IC Integration TSV
8 pages
ZHS Esm
No ratings yet
ZHS Esm
8 pages
Waveguide Grating Coupler Bandwidth Analysis
No ratings yet
Waveguide Grating Coupler Bandwidth Analysis
14 pages
CRISPR/Cas9 Wheat Genome Editing
No ratings yet
CRISPR/Cas9 Wheat Genome Editing
14 pages
An Overview of Blockchain Technology: Architecture, Consensus, and Future Trends
No ratings yet
An Overview of Blockchain Technology: Architecture, Consensus, and Future Trends
9 pages
Motion Trajectory in Biomimic Energy Generators
No ratings yet
Motion Trajectory in Biomimic Energy Generators
16 pages
Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review
No ratings yet
Reinforcement Learning and Its Applications in Modern Power and Energy Systems: A Review
15 pages
3D Slope Stability Under Water Drawdown
No ratings yet
3D Slope Stability Under Water Drawdown
11 pages
Geochemistry ExplorationEnvironmentAnalysis 2016 Chen Geochem2016 424
No ratings yet
Geochemistry ExplorationEnvironmentAnalysis 2016 Chen Geochem2016 424
13 pages
20090521biotechniques Prim SNPing
No ratings yet
20090521biotechniques Prim SNPing
7 pages
Exploring Nothingness in Philosophy
100% (1)
Exploring Nothingness in Philosophy
48 pages
Akhuwat DB Lab 13 Intro - To - PHP
No ratings yet
Akhuwat DB Lab 13 Intro - To - PHP
28 pages
2019-2020 English 7 Monitoring Report
No ratings yet
2019-2020 English 7 Monitoring Report
1 page
Recovery System Solutions Explained
No ratings yet
Recovery System Solutions Explained
3 pages
Understanding Claims and Arguments
No ratings yet
Understanding Claims and Arguments
9 pages
Banking App Test Cases
No ratings yet
Banking App Test Cases
20 pages
OpenSSH & PuTTY Firewall Bypass Guide
No ratings yet
OpenSSH & PuTTY Firewall Bypass Guide
12 pages
Daily Routine Role Play Spanish 2 Fall 2009
No ratings yet
Daily Routine Role Play Spanish 2 Fall 2009
3 pages
Voronoi Noise
No ratings yet
Voronoi Noise
30 pages
3030 - MM
No ratings yet
3030 - MM
424 pages
Religion 3
No ratings yet
Religion 3
8 pages
Matatag Curriculum Lesson Plan in Teaching Speaking
No ratings yet
Matatag Curriculum Lesson Plan in Teaching Speaking
4 pages
English Poetry Internal Test 2024
No ratings yet
English Poetry Internal Test 2024
2 pages
ADHD For Observers ADULT - 19523
No ratings yet
ADHD For Observers ADULT - 19523
5 pages
Automotive Embedded Systems Overview
100% (3)
Automotive Embedded Systems Overview
29 pages
English Summative Exam Part 2
No ratings yet
English Summative Exam Part 2
7 pages
Verbal Reasoning: Logical Word Sequences
No ratings yet
Verbal Reasoning: Logical Word Sequences
20 pages
Become An Extrovert in 7 Days - @DevilsSeduction
100% (1)
Become An Extrovert in 7 Days - @DevilsSeduction
36 pages
LearnEnglish Speaking A2 Giving Instructions
No ratings yet
LearnEnglish Speaking A2 Giving Instructions
5 pages
CAE Writing
No ratings yet
CAE Writing
3 pages
Hebrew Names and Their Meanings
No ratings yet
Hebrew Names and Their Meanings
7 pages
Abu Bakr: The First Rightly Guided Caliph
No ratings yet
Abu Bakr: The First Rightly Guided Caliph
7 pages
Classic Spring Poems Ebook by Pattern Based Writing
No ratings yet
Classic Spring Poems Ebook by Pattern Based Writing
17 pages
Grade 6 Term 2 Creative Arts Plans
No ratings yet
Grade 6 Term 2 Creative Arts Plans
6 pages
University Institute of Engineering CSE-2 Year: Advanced Data Structures and Algorithms
No ratings yet
University Institute of Engineering CSE-2 Year: Advanced Data Structures and Algorithms
26 pages
Jacquelyn Poapst: Debate Expert CV
No ratings yet
Jacquelyn Poapst: Debate Expert CV
3 pages
Grade 5-Gmrc Summative Test
No ratings yet
Grade 5-Gmrc Summative Test
2 pages
Class IX Math Sample Paper 2023
100% (1)
Class IX Math Sample Paper 2023
7 pages
Translating Proper and Geographical Names
No ratings yet
Translating Proper and Geographical Names
17 pages
Transaction Dataflow Testing
No ratings yet
Transaction Dataflow Testing
5 pages

CloudJumpII SIGMOD

Uploaded by

CloudJumpII SIGMOD

Uploaded by

See discussions, stats, and author profiles for this publication at: [Link]

CloudJump II: Optimizing Cloud Databases for Shared Storage

Conference Paper · June 2025

Zongzhi Chen Zheyu Miao

SEE PROFILE SEE PROFILE

The user has requested enhancement of the downloaded file.

2 PRELIMINARIES AND MOTIVATION Leader Node - RW Follower Node - RO

2.1 Cloud-native Databases Query Binlog Query Binlog

The advancement of cloud infrastructure has markedly accelerated Parser/Optimizer/Executor Parser/Optimizer/Executor

elasticity, scalability, and reliability.

Page 3 Page 3 Page 3

≥9 ≥50 ≥9 ≥50 ≥9 ≥50

Page 4 Page 5 Page 4 Page 5 Page 4 Page 5

Leader Node - RW Follower Node - RO a multi-node architecture adhering to leader-follower patterns, in

3 CLOUDJUMP II—THE MVD APPORACH 3.2 Key Design Ideas

Clean Buffer Pool Dirty

written, the partially written data in ib_parsedata is deemed in-

Queries per Second (QPS)

to the background and accepting a temporary performance degra- 0.0

5.2 Query Processing Performance

10 ×105 CPU-bound low workload ×105 CPU-bound high workload

Queries per Second (QPS)

×105 CPU-bound low workload ×105 CPU-bound high workload

-83.3% designed to facilitate scalable, efficient, and fault-tolerant storage

REFERENCES using operator state management. In SIGMOD Conference. ACM, 725–736.

View publication stats

You might also like