Skip to content

pg-spot-ops/pg-spot-operator

Repository files navigation

Release Docker Pulls Tests passing

PG Spot Operator [Community Edition]

Think of it as one-liner RDS, but at a fraction of the cost! Typical savings of running self-managed EC2 Spot instances are around 5x compared to RDS.

Obviously not meant for all projects as a general RDS replacement, as utilizing Spot instances means one can be interrupted by AWS at any time, and it takes a few minutes to restore the state.

On the other hand - Spot eviction rates are insanely good for the price! The average frequency of interruption is only around 5% per month according to AWS data, meaning - one can expect to run a few months uninterrupted, i.e. still in the 99.9+% uptime range!

Based on concepts familiar from the Kubernetes world - user describes a desired state (min. hardware specs, Postgres version, extensions, admin user password etc) and there's a reconciliation loop of sorts.

Quickstart

Let's say we're a Data Scientists (sexiest job of 21st century, remember?) and need to perform some advanced ad-hoc exploration/analytics on a medium-size dataset of a few hundred GB. Sadly all the available development DBs are not much better than our shiny new MacBook Pro - seems our data exploration quest might not exactly be a lot of fun...

Wait, what about tapping into the power of cloud instead? Let's just spin up a private high-end analytics DB for an as-low-as-it-gets cost!

Just in case let's check the pricing beforehand, though - in most cases it will be much better than 3 year Reserved Instances!

# Step 0 - install the pg-spot-operator package via pip/pipx:
pipx install pg-spot-operator

# Resolve user requirements to actual EC2 instance types and show the cheapest instances.
# Note that in this example we only consider EU regions, to get OK latencies, globally one could save even more
pg_spot_operator --check-price \
  --region='eu-' --ram-min=128 \
  --storage-min=500 --storage-type=local

Resolving HW requirements to actual instance types / prices using --selection-strategy=balanced ...
Looking for the top 3 cheapest regions for given HW reqs within: ['eu-central-1', 'eu-central-2', 'eu-north-1', 'eu-south-1', 'eu-south-2', 'eu-west-1', 'eu-west-2', 'eu-west-3']
Top 3 cheapest regions pricing info for selection strategy 'balanced':
===== REGION eu-south-2 =====
Instance type selected for region eu-south-2: gr6.4xlarge (arm)
Main specs - vCPU: 16, RAM: 128 GB, instance storage: 600 GB nvme
Current monthly Spot price for gr6.4xlarge in region eu-south-2: $155.7
Current Spot vs Ondemand discount rate: -86.7% ($155.7 vs $1167.7), approx. 12x to non-HA RDS
Current expected monthly eviction rate range: 5-10%
...

Ok seems eu-south-2 is best for us currently with some incredible pricing, as hinted in the log output - a full work day on a very powerful instance will cost us a mere $1.7 - less than a cup of coffee!

For actually launching any AWS instances we of course need a working CLI (~/.aws/credentials) or have some privileged enough access key and secret available or some transparent "assume role" based scheme set up on the operator host + minimally SSH port 22 access from the operator node to the used Security Group (defaults to the "default" SG). More info on Security Groups and ports can be found in the Security section.

# In --connstr-output-only mode we can land right into `psql`!
psql $(pg_spot_operator --region=eu-south-2 --ram-min=128 \
  --storage-min=500 --storage-type=local \
  --instance-name=analytics --connstr-output-only \
  --admin-user=pgspotops --admin-password=topsecret123
)

2024-11-19 11:47:32,362 INFO Processing manifest for instance 'analytics' set via CLI / ENV ...
...
2024-11-19 11:47:40,778 INFO Launching a new spot instance of type gr6.4xlarge in region eu-south-2 ...
2024-11-19 11:47:54,250 INFO OK - aws VM i-07058e08fae07e50d registered for 'instance' analytics (ip_public = 51.92.44.224 , ip_private = 172.31.43.230)
2024-11-19 11:48:04,251 INFO Applying Postgres tuning profile 'default' to given hardware ...
...
2024-11-19 11:49:58,620 INFO Instance analytics setup completed

psql (17.1 (Ubuntu 17.1-1.pgdg24.04+1), server 16.5 (Debian 16.5-1.pgdg120+1))
SSL connection (protocol: TLSv1.3, cipher: TLS_AES_256_GCM_SHA384, compression: off, ALPN: none)
Type "help" for help.

postgres=# show shared_buffers ;
 shared_buffers
----------------
 26214MB
(1 row)

postgres=# \i my_dataset.sql

postgres=# SELECT ...

PS for more important or long-term purposes you would go with the default --storage-type=network, i.e. EBS, but for our work day or even week it's highly unlikely that the instance will get interrupted, and we rather want speed.

Also note that the instance is tuned according to the hardware already!

Wow, that task went smooth, other people's computers can be really useful sometimes...OK time to call it a day and shut down the instance ...

pg_spot_operator --region=eu-south-2 --instance-name=analytics --teardown

2024-11-19 11:48:04,251 INFO Destroying cloud resources if any for instance analytics ...
...
2024-11-19 11:48:04,251 INFO OK - cloud resources for instance analytics cleaned-up

PS If you don't yet have a safe AWS playground / credentials - start with a bit of Terraform here and feed the output into --aws-vpc-id, --aws-access-key-id and --aws-secret-access-key CLI params.

General idea

  • The user:
    • Specifies a few key parameters like region (required), and optionally some minimum hardware specs - CPU, RAM, target instance families / generations, or a list of suitable instance types explicitly and for local storage also the min. storage size, and maybe also the Postgres version (v15+ supported, defaults to latest stable) and some addon extensions. User input can come in 3 forms:
      • CLI/Env parameters a la --region, --instance-name, --ram-min, --assign-public-ip. Note that in CLI mode not all features can be configured and some common choices are made for the user
      • A YAML manifest as literal text via --manifest/PGSO_MANIFEST
      • A YAML manifest file via --manifest-path/PGSO_MANIFEST_PATH. To get an idea of all possible options/features one could take a look at an example manifest here
    • Specifies AWS credentials if no default AWS CLI (~/.aws/credentials) set up or if using Docker
    • Optionally specifies a callback (executable file) to do something with the resulting Postgres connect string (just displayed by default), or just runs in --connstr-output-only mode to be pipe-friendly
  • The operator:
    • Finds the cheapest Spot instance with OK eviction rate (the default "balanced" --selection-strategy) for given HW requirements and launches a VM
    • Runs Ansible to set up Postgres
    • Keeps checking the VM health every minute (configurable via --main-loop-interval-s) and if eviction detected, launches a new VM, re-mounts the data volume or does a PITR restore from S3 (if --storage-type=local + S3 creds set) and resurrects Postgres

Usage

There are a lot of parameters one can specify, to shape the look of hardware and Postgres instance. Common usage though might look something like below. Note that by default we're in "daemon mode" - checking continuously for the instance health and re-building if needed.

Via Docker

An example Postgres v16 instance with a 1d lifetime and some extensions enabled:

docker run --name pg1 -e PGSO_INSTANCE_NAME=pg1 -e PGSO_REGION=eu-north-1 \
  -e PGSO_EXPIRATION_DATE=$(date --utc --date="+1 day" +%Y-%m-%d) \
  -e PGSO_STORAGE_MIN=100 -e PGSO_STORAGE_TYPE=local -e PGSO_CPU_MIN=2 \
  -e PGSO_EXTENSIONS=vector,pg_stat_statements -e PGSO_OS_EXTRA_PACKAGES=postgresql-16-pgvector \
  -e PGSO_SSH_KEYS="$(cat ~/.ssh/id_rsa.pub)" -e PGSO_POSTGRES_VERSION=16 \
  -e PGSO_AWS_ACCESS_KEY_ID="$(grep -m1 aws_access_key_id ~/.aws/credentials | sed 's/aws_access_key_id = //')" \
  -e PGSO_AWS_SECRET_ACCESS_KEY="$(grep -m1 aws_secret_access_key ~/.aws/credentials | sed 's/aws_secret_access_key = //')" \
  pgspotops/pg-spot-operator:latest

PS The SSH key is optional, to be able to access the cloud VM directly from your workstations for potential troubleshooting etc.

Via Python

# PS Assuming local AWS CLI is configured!
pipx install pg-spot-operator
# Let's run a world-global price check for some in-memory analytics to get some insane $$ value
# PS in default --storage-type=network mode we though still pay normal price for the EBS volumes
pg_spot_operator --check-price --ram-min=256
...
Main specs - vCPU: 32, RAM: 256 GB, instance storage: EBS only
Current monthly Spot price for r6a.8xlarge in region ap-south-2: $95.0
Current Spot vs Ondemand discount rate: -88.5% ($95.0 vs $823.7), approx. 14x to non-HA RDS
Current expected monthly eviction rate range: <5%

pg_spot_operator --ram-min=256 --region=ap-south-2 --instance-name=play
# Note that when installing from PyPI, on first real setup run the Ansible setup files are downloaded
# from Github into ~/.pg-spot-operator/ansible. If this seems too fishy, one can also pre-download.

Integrating with user applications

Although the Community Edition is designed for more light use, one can use it also to power real applications (given they cope with the possible service interruptions of course) by either providing a "setup finished" callback hook or just running in special pipe-friendly --connstr-output-only mode. More details here.

PS Real usage assumes that the engine is kept running on a single node only by the user, as there's by design no global synchronization / consensus store to keep things simple.

Enterprise Edition

Although the Community Edition works and is free to use also for all non-compete business purposes, it's taking the simplest approach to persistent Spot instances really, so that some aspects of the solution are "best-efforty" and one could do much more to ensure better uptimes and usability.

If you'd be interested in massive cost saving also for more critical Postgres databases, please register your email address via this form https://tinyurl.com/pgspotops to get notified once the upcoming Enterprise Edition is released.

Most import features of the Enterprise Edition:

  • HA / multi-node setups
  • GCP and Azure Spot instances support
  • Advanced eviction rate heuristics for better uptimes
  • Volume auto-growth
  • Temporary fallbacks to regular non-spot VMs once uptime budget burned
  • Major version upgrades
  • Stop / sleep schedules for even more savings
  • Better integration with typical DevOps flows
  • More security, e.g. certificate access
  • A CLI for ad-hoc DBA operations

Sustainable Open Source / VC info

As crazy as it might sound, we believe that such a solution (in a more polished form) would be a great addition to the Postgres ecosystem and also commercially viable, so we're going to give it a try. To speed up the development though, we'd also be interested in VC dollars - thus feel free to reach out to [email protected] if you happen to possess some and find the niche interesting.

Project status

Working Beta

  • Manifest API not yet fully fixed (not relevant though when using Docker or the CLI).
  • No guarantees on internal configuration being kept backwards compatible - thus might need to clean up ~/.pg-spot-operator if getting weird errors after a version update.

Other topics