Extracts all objects from Sui using RPC calls.
Data is extracted by loading checkpoints. Each checkpoint contains multiple transactions. Each transaction may contain object updates - new objects, deleted objects, or updated objects. We leverage Tokio to concurrently extract this data in multiple threads.
Upon initialization, the app does several things:
- Connect to MongoDB and determine the last checkpoint it has read. If this is the first run, it will begin at the newest checkpoint and work toward the oldest checkpoint.
- Connect to one or more RPC servers, each of which is used round-robin for RPC calls.
- Create Tokio threads to handle concurrent RPC calls for each checkpoint.
- Create a RocksDB database file, which maintains state for each Tokio thread. Because each checkpoint contains multiple transactions, and each transaction contains multiple object changes, we use multiple Tokio threads to handle each checkpoint. RocksDB is the persistent storage layer to allow several Tokio threads to determine which checkpoints and transactions have been processed.
- Create a log file (if configured to do so). Otherwise, logs go to stdout.
- Create Tokio channels and workers. Each step of the pipeline relies on Tokio channels to manage data and facilitate backpressure.
- Create Pulsar topics. Each thread has its own Pulsar topic. The thread will only write to Pulsar when it fails to extract data via RPC.
Using RPC calls, the app polls for new checkpoints, fetches all transactions for each checkpoint, and then fetches all object changes for each transaction. There is a distinct Tokio channel called the 'control channel' which streams metadata about each stage of the data processing pipeline, ensuring integrity and that all transaction blocks and objects are ingested. It also tracks ingest latency.
- livescan: This default mode polls for the latest checkpoint and gradually traverses backward.
- backfill: This mode is enabled automatically when livescan is lagging by a certain number of checkpoints. The threshold is configured as
backfillthreshold
. This will trigger many more Tokio threads to be opened for RPC calls and MongoDB CRUD operations. This can cause crashes if the system cannot handle the input. There are several options available to help tune the overall throughput.
The app will automatically detect when there are several checkpoints in the backlog to index. This effectively triggers a backfilling operation which launches more ingest threads. Because multiple workers could be ingesting the same data, the app maintains a list of in-flight object IDs as a key/value store in RocksDB. While transaction blocks are scanned, RocksDB queries are issued on each object ID, and skipped if there is a match. This ensures that only one Tokio worker is handling an object change and avoids duplicate CRUD operations to downstream systems.
Initialized in the spawn_livescan()
function. Spawns one or more Tokio workers based on configuration option livescan.workers.checkpoint
. When an un-indexed checkpoint is polled, the app paginates through all transaction blocks contained in the checkpoint.
Because transaction blocks only contain object IDs for affected objects, additional RPC calls must be issued to fetch the complete object data. In livescan
Sui Object updates are published to MongoDB. If MongoDB cannot handle the incoming operations, the app will crash and you will need to detune the pipeline settings. The Rust code seems to extract data more quickly than MongoDB can keep up, if the configuration is too aggressive. If you see random crashes, this is a likely cause.
Sui Object data that is loaded into MongoDB with the Sui Object Indexer is accessible via a GraphQL API. You may also queries MongoDB directly, if you so choose. All fields in the objects - including nested fields - are accessible via GraphQL. Unlike the Sui Core RPC and Indexing APIs, which only store the BCS of Sui objects, you can filter, sort, and run other queries using the fields inside your Sui objects.
- Located in
server
directory of the repo. - Example queries are located in
example-queries
folder of the repo. - We strongly recommend creating indices on critical fields used in your queries to improve performance and cost optimization of MongoDB. Examples are included in
example-queries
. - Cost reduction and query speed can be achieved by narrowing down the number of objects you load into MongoDB via the Sui Object Indexer. For example, if you are only working with data from one or a handful of Sui Move Packages, you can configure the indexer to exclusivley load those items. This is documented in
config.yaml
in themain
directory.
Default values are configured in config.yaml
and can be overriden with environment variables. All values are well documented inside config.yaml
.
Generally any config value in config.yaml
can be overridden with an environment variable.
Nested values are separated by _
and env variables should start with APP_
to be considered.
Example:
APP_MONGO_URI=mongodb+srv://username:[email protected]/?retryWrites=true&w=majority
- Cost reduction and query speed can be achieved by narrowing down the number of objects you load into MongoDB via the Sui Object Indexer. For example, if you are only working with data from one or a handful of Sui Move Packages, you can configure the indexer to exclusivley load those items. This is documented in
config.yaml
. - You may alternatively blacklist package IDs rather than whitelist.
- MongoDB - We suggest MongoDB Atlas, but you may manage your own open source database instead.
- InfluxDB - We suggest InfluxDB Cloud, but you may manage your own open source database instead.
- Pulsar - We suggest StreamNative Cloud, but you may manage your own open source cluster instead.
- Clone repo into directory of your choice.
- Configure
config.yaml
. All parameters are documented inside the file. - Ensure
pulsar-credentials.json
is located in correct directory. Default is/opt/pulsar-credentials.json
. If using StreamNative Cloud, you can generate and download this file in the web UI. - Create MongoDB user and database. These must match the parameters in
config.yaml
. This is done in the UI in MongoDB Atlas. If you are using MongoSH, there are example commands inexample-queries/mongodb/mongosh-setup-examples
. - Create InfluxDB bucket and token. This is done in the web UI whether you are using InfluxDB Cloud or a self-managed, open-source database.
- Open the port for the GraphQL web service.
sudo ufw allow 8000
- Install indexer application: Move to the
main
directory of the repo and runcargo install --path .
This will create a new executable calledindexer
. This can take a long time to build Sui dependencies. - Start indexer in background:
setsid indexer
- Install GraphQL webserver application: Move to the
server
directory of the repo and runcargo install --path .
- Start GraphQL webserver in background:
setsid
- If you are only running the indexer and webserver, a fairly small server will suffice (4gb RAM + 4 CPU cores). You may need a larger server to facilitate a large number of queries or aggressive indexing settings.
- If you used
setsid
to launch the apps, you can usehtop
to see the process info. - Logs are located at
/var/log/indexer.log
- Fly is a hosting platform that automatically builds, deploys, and runs Docker containers. You can also manage secrets like API keys and databse credentials in Fly.io. We have been able to host each Rust app on Fly for around $25 / month each.
- Note that you will still need to use external services for items in the
Prerequisites
section of this document. - Each repo
main
(Indexer) andserver
(GraphQL API) contain an example Fly.io configuration file. The instance sizes and other configuration options are a great starting point for both services. You can find the most up-to-date instrucctions on the Fly.io website. https://fly.io/docs/apps/launch/