🪠Data Integration
'Drop-in' Kafka Streams State Store implementation that persists data to Apache Cassandra / ScyllaDB
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
High-performance, low-footprint SQL database written in C++. Process millions of rows per second from Kafka, Pulsar, or ClickHouse, and seamlessly write results back. Supports powerful features lik…
Possibly the fastest DataFrame-agnostic quality check library in town.
Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.
Kubernetes Operator for StarRocks
An open protocol for secure data sharing
The Metadata Platform for your Data and AI Stack
Kubernetes-native platform to run massively parallel data/streaming jobs
SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.
Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.
Open, Multi-modal Catalog for Data & AI
Synthetic data generators for tabular and time-series data
Headless TypeScript ORM with a head. Runs on Node, Bun and Deno. Lives on the Edge and yes, it's a JavaScript ORM too 😅
Apache Polaris, the interoperable, open source catalog for Apache Iceberg
FastStream is a powerful and easy-to-use Python framework for building asynchronous services interacting with event streams such as Apache Kafka, RabbitMQ, NATS and Redis.
Database Markup Language (DBML), designed to define and document database structures
Copy to/from Parquet in S3 or Azure Blob Storage from within PostgreSQL
Fastest library to load data from DB to DataFrames in Rust and Python
Modern and easy to use SQL client for MySQL, Postgres, SQLite, SQL Server, and more. Linux, MacOS, and Windows.
Swiftly build and enhance your Kafka Streams applications.
This is a repo with links to everything you'd ever want to learn about data engineering
Readyset is a MySQL and Postgres wire-compatible caching layer that sits in front of existing databases to speed up queries and horizontally scale read throughput. Under the hood, ReadySet caches t…
Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.
Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.
Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.