#llm #nlp #structured-data #extract

bin+lib langextract-rust

A Rust library for extracting structured and grounded information from text using LLMs

9 releases (4 breaking)

new 0.5.0 Feb 14, 2026
0.4.3 Aug 31, 2025
0.3.1 Aug 18, 2025
0.2.0 Aug 15, 2025
0.1.0 Aug 15, 2025

#1168 in Text processing

Download history 357/week @ 2026-02-12

357 downloads per month

Apache-2.0

575KB
11K SLoC

LangExtract (Rust)

A Rust library for extracting structured, source-grounded information from unstructured text using LLMs. Every extraction is mapped back to exact character offsets in the original document, enabling verification and interactive highlighting.

The core workflow: provide a few examples of what to extract, the library builds a few-shot prompt, chunks large documents, sends chunks to an LLM in parallel, parses structured JSON responses, aligns extracted values back to character positions in the source text, then deduplicates and aggregates results.

Key Features

  • High-performance async processing with configurable concurrency via buffer_unordered
  • Multiple provider support — OpenAI, Ollama, and custom HTTP APIs
  • Character-level alignment — exact match then fuzzy word-overlap fallback
  • Validation and type coercion — schema validation, raw data preservation, automatic type detection
  • Visualization — export to interactive HTML, Markdown, JSON, and CSV
  • Multi-pass extraction — improved recall through targeted reprocessing of low-yield chunks
  • Semantic chunking — intelligent text splitting via semchunk-rs with sentence boundary awareness
  • Memory efficient — zero-copy document sharing via Arc, pre-computed tokenization

Quick Start

CLI Installation

From source (requires Rust):

cargo install langextract-rust --features cli

From repository:

git clone https://github.com/modularflow/langextract-rust
cd langextract-rust
cargo install --path . --features cli

CLI Usage

# Initialize configuration
lx-rs init --provider ollama

# Extract from text
lx-rs extract "John Doe is 30 years old" --prompt "Extract names and ages" --provider ollama

# Process files with HTML export
lx-rs extract document.txt --examples examples.json --export html --provider ollama

# Test provider connectivity
lx-rs test --provider ollama

# List available providers
lx-rs providers

Library Usage

Add to your Cargo.toml:

[dependencies]
langextract-rust = "0.4"

Basic example:

use langextract_rust::{
    extract, ExtractConfig, FormatType,
    ExampleData, Extraction,
};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let examples = vec![
        ExampleData::new(
            "John Doe is 30 years old and works as a doctor".to_string(),
            vec![
                Extraction::new("person".to_string(), "John Doe".to_string()),
                Extraction::new("age".to_string(), "30".to_string()),
                Extraction::new("profession".to_string(), "doctor".to_string()),
            ],
        )
    ];

    let config = ExtractConfig {
        model_id: "mistral".to_string(),
        model_url: Some("http://localhost:11434".to_string()),
        temperature: 0.3,
        max_char_buffer: 8000,
        max_workers: 6,
        ..Default::default()
    };

    let result = extract(
        "Alice Smith is 25 years old and works as a doctor. Bob Johnson is 35 and is an engineer.",
        Some("Extract person names, ages, and professions from the text"),
        &examples,
        config,
    ).await?;

    println!("Extracted {} items", result.extraction_count());

    if let Some(extractions) = &result.extractions {
        for e in extractions {
            println!("  [{}] '{}' at {:?}",
                e.extraction_class,
                e.extraction_text,
                e.char_interval,
            );
        }
    }

    Ok(())
}

CLI Reference

Extract Command

# From file with options
lx-rs extract document.txt \
  --examples patterns.json \
  --provider openai \
  --model gpt-4o \
  --max-chars 12000 \
  --workers 10 \
  --batch-size 6 \
  --temperature 0.1 \
  --multipass \
  --passes 3 \
  --export html \
  --show-intervals \
  --verbose

# From URL
lx-rs extract "https://example.com/article.html" \
  --prompt "Extract key facts" \
  --provider openai

Configuration Commands

lx-rs init --provider ollama          # Initialize config
lx-rs init --provider openai --force  # Overwrite existing
lx-rs test --provider ollama          # Test connectivity
lx-rs providers                       # List providers

Configuration Files

examples.json

[
  {
    "text": "Dr. Sarah Johnson works at Mayo Clinic in Rochester, MN",
    "extractions": [
      {"extraction_class": "person", "extraction_text": "Dr. Sarah Johnson"},
      {"extraction_class": "organization", "extraction_text": "Mayo Clinic"},
      {"extraction_class": "location", "extraction_text": "Rochester, MN"}
    ]
  }
]

.env

OPENAI_API_KEY=your_openai_key_here
OLLAMA_BASE_URL=http://localhost:11434

Supported Providers

Provider Models Notes
OpenAI gpt-4o, gpt-4o-mini, gpt-3.5-turbo Via async-openai, feature-gated (--features openai)
Ollama mistral, llama2, codellama, qwen Local inference via HTTP to /api/generate
Custom Any OpenAI-compatible API For vLLM, LiteLLM, and other compatible endpoints

Provider Setup

# OpenAI
export OPENAI_API_KEY="your-key-here"

# Ollama (local)
ollama serve
ollama pull mistral

Configuration

The ExtractConfig struct controls extraction behavior:

let config = ExtractConfig {
    model_id: "mistral".to_string(),
    temperature: 0.3,              // Lower = more consistent output
    max_char_buffer: 8000,         // Characters per chunk
    batch_length: 6,               // Chunks per batch
    max_workers: 8,                // Concurrent workers
    enable_multipass: false,      // Multi-pass extraction (when true, use multipass_max_passes)
    multipass_max_passes: 2,       // Max passes when multipass enabled (default: 2)
    debug: false,                  // Debug output and raw file saving
    ..Default::default()
};

Tuning Guidelines

  • max_workers: 6-12 for parallel throughput
  • batch_length: 4-8 for optimal batching
  • max_char_buffer: 6000-12000 characters per chunk
  • temperature: 0.1-0.3 for consistent extraction

Advanced Features

Validation and Type Coercion

use langextract_rust::{ValidationConfig};

let validation_config = ValidationConfig {
    enable_schema_validation: true,
    enable_type_coercion: true,    // "$1,234" -> 1234.0, "95%" -> 0.95
    require_all_fields: false,
    save_raw_outputs: false,
    ..Default::default()
};

Supported coercion types: integers, floats, booleans, currencies, percentages, emails, phone numbers, dates, URLs.

Visualization

use langextract_rust::visualization::{export_document, ExportConfig, ExportFormat};

let config = ExportConfig {
    format: ExportFormat::Html,
    title: Some("Analysis".to_string()),
    highlight_extractions: true,
    show_char_intervals: true,
    ..Default::default()
};

let html = export_document(&annotated_doc, &config)?;
std::fs::write("analysis.html", html)?;

Provider Configuration

use langextract_rust::providers::ProviderConfig;

let openai = ProviderConfig::openai("gpt-4o-mini", Some(api_key));
let ollama = ProviderConfig::ollama("mistral", Some("http://localhost:11434".to_string()));

Error Handling

use langextract_rust::LangExtractError;

match extract(/* ... */).await {
    Ok(result) => println!("{} extractions", result.extraction_count()),
    Err(LangExtractError::ConfigurationError(msg)) => {
        eprintln!("Configuration: {}", msg);
    }
    Err(LangExtractError::InferenceError { message, provider, .. }) => {
        eprintln!("Inference ({}): {}", provider.unwrap_or("unknown".into()), message);
    }
    Err(LangExtractError::NetworkError(e)) => {
        eprintln!("Network: {}", e);
    }
    Err(e) => eprintln!("Error: {}", e),
}

Architecture

Text Input -> extract() -> Prompting -> Chunking -> LLM Inference -> Parsing -> Alignment -> Aggregation -> Result

Key modules:

  • annotation.rs — orchestrates the chunk-infer-parse-align loop
  • chunking.rs — semantic and token-based text splitting
  • alignment.rs — exact + fuzzy character offset mapping
  • resolver.rs — JSON parsing, repair, and type coercion
  • multipass.rs — multi-pass extraction with quality scoring
  • pipeline.rs — multi-step extraction with dependency resolution
  • visualization.rs — HTML, Markdown, CSV, JSON export

See SPEC.md for the complete technical specification.

Testing

cargo test --lib

Documentation

  • SPEC.md — Technical specification, architecture, known issues, and fix priorities

License

Licensed under the Apache License, Version 2.0. See LICENSE for details.

Acknowledgments

This is a Rust port of Google's langextract Python library.

@misc{langextract,
  title={langextract},
  author={Google Research Team},
  year={2024},
  publisher={GitHub},
  url={https://github.com/google/langextract}
}

Dependencies

~27–47MB
~475K SLoC