Skip to content
View chgl's full-sized avatar
🧊
🧊

Block or report chgl

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

🪠 Data Integration

120 repositories

'Drop-in' Kafka Streams State Store implementation that persists data to Apache Cassandra / ScyllaDB

Java 24 5 Updated Feb 12, 2025

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

Python 2,040 93 Updated Sep 21, 2024

High-performance, low-footprint SQL database written in C++. Process millions of rows per second from Kafka, Pulsar, or ClickHouse, and seamlessly write results back. Supports powerful features lik…

C++ 1,653 74 Updated Feb 14, 2025

Possibly the fastest DataFrame-agnostic quality check library in town.

Python 181 21 Updated Feb 11, 2025
Java 179 80 Updated Feb 14, 2025

Apache Paimon is a lake format that enables building a Realtime Lakehouse Architecture with Flink and Spark for both streaming and batch operations.

Java 2,638 1,026 Updated Feb 17, 2025

Kubernetes Operator for StarRocks

Go 145 71 Updated Feb 14, 2025

An open protocol for secure data sharing

Scala 804 182 Updated Jan 30, 2025

The Metadata Platform for your Data and AI Stack

Java 10,294 3,045 Updated Feb 17, 2025

Kubernetes-native platform to run massively parallel data/streaming jobs

Go 1,790 124 Updated Feb 17, 2025

SeaTunnel is a next-generation super high-performance, distributed, massive data integration tool.

Java 8,282 1,902 Updated Feb 17, 2025

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics

C++ 14,999 3,621 Updated Feb 17, 2025

Python ETL framework for stream processing, real-time analytics, LLM pipelines, and RAG.

Python 13,638 304 Updated Feb 17, 2025

Open, Multi-modal Catalog for Data & AI

Python 2,658 436 Updated Feb 13, 2025

Synthetic data generators for tabular and time-series data

Jupyter Notebook 1,498 246 Updated Feb 14, 2025

Headless TypeScript ORM with a head. Runs on Node, Bun and Deno. Lives on the Edge and yes, it's a JavaScript ORM too 😅

TypeScript 26,318 766 Updated Feb 17, 2025

Apache OpenDAL: One Layer, All Storage.

Rust 3,776 524 Updated Feb 17, 2025

Apache Polaris, the interoperable, open source catalog for Apache Iceberg

Java 1,339 179 Updated Feb 16, 2025

FastStream is a powerful and easy-to-use Python framework for building asynchronous services interacting with event streams such as Apache Kafka, RabbitMQ, NATS and Redis.

Python 3,462 187 Updated Feb 10, 2025

Database Markup Language (DBML), designed to define and document database structures

JavaScript 3,016 179 Updated Feb 17, 2025

Copy to/from Parquet in S3 or Azure Blob Storage from within PostgreSQL

Rust 427 15 Updated Jan 31, 2025

Fastest library to load data from DB to DataFrames in Rust and Python

Rust 2,118 165 Updated Feb 14, 2025

Modern and easy to use SQL client for MySQL, Postgres, SQLite, SQL Server, and more. Linux, MacOS, and Windows.

TypeScript 17,421 1,133 Updated Feb 15, 2025

Swiftly build and enhance your Kafka Streams applications.

Java 108 21 Updated Feb 15, 2025

This is a repo with links to everything you'd ever want to learn about data engineering

Jupyter Notebook 26,595 5,441 Updated Jan 6, 2025

Readyset is a MySQL and Postgres wire-compatible caching layer that sits in front of existing databases to speed up queries and horizontally scale read throughput. Under the hood, ReadySet caches t…

Rust 4,816 136 Updated Feb 17, 2025

Scalable and flexible workflow orchestration platform that seamlessly unifies data, ML and analytics stacks.

Go 6,008 690 Updated Feb 17, 2025

Zero-ETL, infinite possibilities. Live query APIs, code & more with SQL. No DB required.

Go 7,180 285 Updated Feb 11, 2025

Apache Amoro (incubating) is a Lakehouse management system built on open data lake formats.

Java 917 310 Updated Feb 10, 2025

Postgres read replica optimized for analytics

Go 1,273 27 Updated Feb 14, 2025