9 releases (4 breaking)

new 0.5.0	Feb 14, 2026
0.4.3	Aug 31, 2025
0.3.1	Aug 18, 2025
0.2.0	Aug 15, 2025
0.1.0	Aug 15, 2025

#1168 in Text processing

357 downloads per month

Apache-2.0

575KB
11K SLoC

LangExtract (Rust)

A Rust library for extracting structured, source-grounded information from unstructured text using LLMs. Every extraction is mapped back to exact character offsets in the original document, enabling verification and interactive highlighting.

The core workflow: provide a few examples of what to extract, the library builds a few-shot prompt, chunks large documents, sends chunks to an LLM in parallel, parses structured JSON responses, aligns extracted values back to character positions in the source text, then deduplicates and aggregates results.

Key Features

High-performance async processing with configurable concurrency via buffer_unordered
Multiple provider support — OpenAI, Ollama, and custom HTTP APIs
Character-level alignment — exact match then fuzzy word-overlap fallback
Validation and type coercion — schema validation, raw data preservation, automatic type detection
Visualization — export to interactive HTML, Markdown, JSON, and CSV
Multi-pass extraction — improved recall through targeted reprocessing of low-yield chunks
Semantic chunking — intelligent text splitting via semchunk-rs with sentence boundary awareness
Memory efficient — zero-copy document sharing via Arc, pre-computed tokenization

Quick Start

CLI Installation

From source (requires Rust):

cargo install langextract-rust --features cli

From repository:

git clone https://github.com/modularflow/langextract-rust
cd langextract-rust
cargo install --path . --features cli

CLI Usage

# Initialize configuration
lx-rs init --provider ollama

# Extract from text
lx-rs extract "John Doe is 30 years old" --prompt "Extract names and ages" --provider ollama

# Process files with HTML export
lx-rs extract document.txt --examples examples.json --export html --provider ollama

# Test provider connectivity
lx-rs test --provider ollama

# List available providers
lx-rs providers

Library Usage

Add to your Cargo.toml:

[dependencies]
langextract-rust = "0.4"

Basic example:

use langextract_rust::{
    extract, ExtractConfig, FormatType,
    ExampleData, Extraction,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let examples = vec![
        ExampleData::new(
            "John Doe is 30 years old and works as a doctor".to_string(),
            vec![
                Extraction::new("person".to_string(), "John Doe".to_string()),
                Extraction::new("age".to_string(), "30".to_string()),
                Extraction::new("profession".to_string(), "doctor".to_string()),
            ],
        )
    ];

    let config = ExtractConfig {
        model_id: "mistral".to_string(),
        model_url: Some("http://localhost:11434".to_string()),
        temperature: 0.3,
        max_char_buffer: 8000,
        max_workers: 6,
        ..Default::default()
    };

    let result = extract(
        "Alice Smith is 25 years old and works as a doctor. Bob Johnson is 35 and is an engineer.",
        Some("Extract person names, ages, and professions from the text"),
        &examples,
        config,
    ).await?;

    println!("Extracted {} items", result.extraction_count());

    if let Some(extractions) = &result.extractions {
        for e in extractions {
            println!("  [{}] '{}' at {:?}",
                e.extraction_class,
                e.extraction_text,
                e.char_interval,
            );
        }
    }

    Ok(())
}

CLI Reference

Extract Command

# From file with options
lx-rs extract document.txt \
  --examples patterns.json \
  --provider openai \
  --model gpt-4o \
  --max-chars 12000 \
  --workers 10 \
  --batch-size 6 \
  --temperature 0.1 \
  --multipass \
  --passes 3 \
  --export html \
  --show-intervals \
  --verbose

# From URL
lx-rs extract "https://example.com/article.html" \
  --prompt "Extract key facts" \
  --provider openai

Configuration Commands

lx-rs init --provider ollama          # Initialize config
lx-rs init --provider openai --force  # Overwrite existing
lx-rs test --provider ollama          # Test connectivity
lx-rs providers                       # List providers

Configuration Files

examples.json

[
  {
    "text": "Dr. Sarah Johnson works at Mayo Clinic in Rochester, MN",
    "extractions": [
      {"extraction_class": "person", "extraction_text": "Dr. Sarah Johnson"},
      {"extraction_class": "organization", "extraction_text": "Mayo Clinic"},
      {"extraction_class": "location", "extraction_text": "Rochester, MN"}
    ]
  }
]

.env

OPENAI_API_KEY=your_openai_key_here
OLLAMA_BASE_URL=http://localhost:11434

Supported Providers

Provider	Models	Notes
OpenAI	gpt-4o, gpt-4o-mini, gpt-3.5-turbo	Via `async-openai`, feature-gated (`--features openai`)
Ollama	mistral, llama2, codellama, qwen	Local inference via HTTP to `/api/generate`
Custom	Any OpenAI-compatible API	For vLLM, LiteLLM, and other compatible endpoints

Provider Setup

# OpenAI
export OPENAI_API_KEY="your-key-here"

# Ollama (local)
ollama serve
ollama pull mistral

Configuration

The ExtractConfig struct controls extraction behavior:

let config = ExtractConfig {
    model_id: "mistral".to_string(),
    temperature: 0.3,              // Lower = more consistent output
    max_char_buffer: 8000,         // Characters per chunk
    batch_length: 6,               // Chunks per batch
    max_workers: 8,                // Concurrent workers
    enable_multipass: false,      // Multi-pass extraction (when true, use multipass_max_passes)
    multipass_max_passes: 2,       // Max passes when multipass enabled (default: 2)
    debug: false,                  // Debug output and raw file saving
    ..Default::default()
};

Tuning Guidelines

max_workers: 6-12 for parallel throughput
batch_length: 4-8 for optimal batching
max_char_buffer: 6000-12000 characters per chunk
temperature: 0.1-0.3 for consistent extraction

Advanced Features

Validation and Type Coercion

use langextract_rust::{ValidationConfig};

let validation_config = ValidationConfig {
    enable_schema_validation: true,
    enable_type_coercion: true,    // "$1,234" -> 1234.0, "95%" -> 0.95
    require_all_fields: false,
    save_raw_outputs: false,
    ..Default::default()
};

Supported coercion types: integers, floats, booleans, currencies, percentages, emails, phone numbers, dates, URLs.

Visualization

use langextract_rust::visualization::{export_document, ExportConfig, ExportFormat};

let config = ExportConfig {
    format: ExportFormat::Html,
    title: Some("Analysis".to_string()),
    highlight_extractions: true,
    show_char_intervals: true,
    ..Default::default()
};

let html = export_document(&annotated_doc, &config)?;
std::fs::write("analysis.html", html)?;

Provider Configuration

use langextract_rust::providers::ProviderConfig;

let openai = ProviderConfig::openai("gpt-4o-mini", Some(api_key));
let ollama = ProviderConfig::ollama("mistral", Some("http://localhost:11434".to_string()));

Error Handling

use langextract_rust::LangExtractError;

match extract(/* ... */).await {
    Ok(result) => println!("{} extractions", result.extraction_count()),
    Err(LangExtractError::ConfigurationError(msg)) => {
        eprintln!("Configuration: {}", msg);
    }
    Err(LangExtractError::InferenceError { message, provider, .. }) => {
        eprintln!("Inference ({}): {}", provider.unwrap_or("unknown".into()), message);
    }
    Err(LangExtractError::NetworkError(e)) => {
        eprintln!("Network: {}", e);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Architecture

Text Input -> extract() -> Prompting -> Chunking -> LLM Inference -> Parsing -> Alignment -> Aggregation -> Result

Key modules:

annotation.rs — orchestrates the chunk-infer-parse-align loop
chunking.rs — semantic and token-based text splitting
alignment.rs — exact + fuzzy character offset mapping
resolver.rs — JSON parsing, repair, and type coercion
multipass.rs — multi-pass extraction with quality scoring
pipeline.rs — multi-step extraction with dependency resolution
visualization.rs — HTML, Markdown, CSV, JSON export

See SPEC.md for the complete technical specification.

Testing

cargo test --lib

Documentation

SPEC.md — Technical specification, architecture, known issues, and fix priorities

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

This is a Rust port of Google's langextract Python library.

@misc{langextract,
  title={langextract},
  author={Google Research Team},
  year={2024},
  publisher={GitHub},
  url={https://github.com/google/langextract}
}

Dependencies

~27–47MB
~475K SLoC