1 unstable release

Uses new Rust 2024

0.1.0 Feb 8, 2026

#651 in Machine learning

MIT license

140KB
4K SLoC

Rust 3K SLoC // 0.0% comments Python 666 SLoC // 0.0% comments Shell 300 SLoC // 0.0% comments

license: mit library_name: custom pipeline_tag: feature-extraction tags:

  • sentence-embeddings
  • text-embeddings
  • rust

auto-g-embed

Local semantic embedding pipeline with a Rust-native runtime.

What this repo provides

  • Contrastive dataset preparation (prepare_contrastive)
  • Rust-native embedder training (train_rust_embedder)
  • Runtime embedding APIs and examples
  • Optional ONNX/SentenceTransformer path in training/

Quick start

cargo test

./training/run_pipeline.sh \
  --profile kaggle_questions_million \
  --source-csv data/kaggle/one-million-reddit-questions.csv

Run the Rust embedding example:

cargo run --example rust_embed -- \
  artifacts/model/rust-embedder \
  "A quick test sentence for semantic embeddings."

Model artifacts

Published model artifacts are available on Hugging Face:

Project layout

  • src/: library modules and binaries
  • examples/: runnable embedding demos
  • tests/: integration/performance tests
  • training/: pipeline scripts and dataset adapters

Development checks

cargo fmt --all -- --check
cargo clippy --all-targets --all-features -- -D warnings
cargo test

Community Benchmark

Run the reproducible benchmark CLI:

cargo run --release --bin community_benchmark -- \
  --output artifacts/benchmarks/latest.json

The output includes throughput, latency percentiles (p50/p95/p99), retrieval quality metrics, and environment metadata for publishing. Methodology and reporting guidance: BENCHMARKS.md.

Latest Benchmark (M4 Max) (February 8, 2026):

cargo run --release --bin community_benchmark -- \
  --eval-count 500 --warmup-count 100 --query-count 32 \
  --output artifacts/benchmarks/smoke.json
  • embeds_per_second: 219595.18
  • p50_us: 3.88
  • p95_us: 6.54
  • p99_us: 6.71
  • top1_accuracy: 0.9375
  • separation: 0.2886

Comparison Chart

Model embeds_per_second p50_us p95_us p99_us top1_accuracy separation
auto-g-embed (local smoke run) 219595.18 3.88 6.54 6.71 0.9375 0.2886
Llama-3.2-NV-EmbedQA-1B-v2 140.7 7000 8000 N/R N/R N/R
Llama-3.2-NeMo-Retriever-300M-Embed-V1 126.0 8000 8300 N/R N/R N/R
NV-EmbedQA-E5-v5 196.3 5100 5400 N/R N/R N/R
NV-EmbedQA-Mistral7B-v2 67.9 14600 15400 N/R N/R N/R
SwiftEmbed (paper) 50000 1120 N/R N/R N/R N/R

Notes:

Additional docs

  • Training and pipeline details: training/README.md
  • Test data notes: test-data/README.md

Dependencies

~53–90MB
~1.5M SLoC