2 unstable releases

new 0.6.0 Mar 10, 2026
0.5.1 Mar 6, 2026

#97 in Email

MIT/Apache

98KB
2K SLoC

Rust 1.5K SLoC // 0.1% comments TypeScript 472 SLoC // 0.1% comments

langmail

Email preprocessing for LLMs. Fast, typed, Rust-powered.

npm CI license

Emails are messy — nested MIME parts, quoted reply chains, HTML cruft, signatures, forwarded headers. LLMs don't need any of that. langmail strips it all away and gives you clean, structured text optimized for language model consumption.

import { preprocessString } from "langmail";

const result = preprocessString(rawEmail);

console.log(result.body);
// → "Hi Alice! Great to hear from you."
// (no quoted replies, no signature, no HTML noise)

console.log(result.from);
// → { name: "Bob", email: "bob@example.com" }

Why langmail?

  • Built for LLMs — minimizes token waste by stripping quoted replies, signatures, and HTML noise
  • Fast — Rust core with zero-copy parsing via mail-parser
  • Typed — full TypeScript definitions, every field documented
  • Multilingual — detects quote patterns in English, German, French, and Spanish
  • One functionpreprocess() does everything; options available when you need them

Install

npm install langmail

Requires Node.js 18 or later.

Prebuilt native binaries for Linux (x64, arm64), macOS (x64, arm64), and Windows (x64). No Rust toolchain needed.

Usage

Basic

import { preprocess } from "langmail";
import { readFileSync } from "fs";

// From raw .eml file
const raw = readFileSync("message.eml");
const result = preprocess(raw);

// Or from a string (e.g. Gmail API response)
import { preprocessString } from "langmail";
const result = preprocessString(rawEmailString);

With options

import { preprocessWithOptions } from "langmail";

const result = preprocessWithOptions(raw, {
  stripQuotes: true, // Remove quoted replies (default: true)
  stripSignature: true, // Remove email signatures (default: true)
  maxBodyLength: 4000, // Truncate body to N chars (default: 0 = no limit)
});

Format for LLM prompts

toLlmContext converts a ProcessedEmail into a compact, deterministic plain-text block ready to paste into an LLM prompt:

import { preprocess, toLlmContext } from "langmail";

const result = preprocess(raw);
console.log(toLlmContext(result));
// FROM: Bob <bob@example.com>
// TO: Alice <alice@example.com>
// SUBJECT: Re: Project update
// DATE: 2024-01-15T10:30:00Z
// CONTENT:
// Hi Alice! Great to hear from you.

Missing fields (no from, empty to, etc.) are simply omitted. The CONTENT: line is always present.

Output structure

interface ProcessedEmail {
  body: string; // Clean text, ready for your LLM
  subject?: string;
  from?: { name?: string; email: string };
  to: { name?: string; email: string }[];
  cc: { name?: string; email: string }[];
  date?: string; // ISO 8601
  rfcMessageId?: string; // RFC 2822 Message-ID header
  inReplyTo?: string[]; // Threading
  references?: string[]; // Threading
  signature?: string; // Extracted signature (if found)
  rawBodyLength: number; // Before cleaning
  cleanBodyLength: number; // After cleaning
}

Error handling

preprocess, preprocessWithOptions, and preprocessString throw if the input cannot be parsed as a valid RFC 5322 message:

try {
  const result = preprocess(raw);
} catch (err) {
  // err.message === "Failed to parse email message"
}

toLlmContext never throws.

What it does

Step Before After
MIME parsing Raw RFC 5322 bytes Structured parts
HTML → text <p>Hello <b>world</b></p> Hello world
Quote stripping Gmail/Outlook/Apple Mail quoted replies Just the new message
Signature removal -- \nJohn Doe\nCEO, Acme Corp\n555-0123 Body without signature
Whitespace cleanup Excessive blank lines, trailing spaces Clean, normalized text

Supported quote patterns

  • Gmail: On <date>, <name> <email> wrote:
  • Outlook: -----Original Message----- and From: ... Sent: ...
  • Apple Mail: On <date>, at <time>, <name> wrote:
  • Forwarded: -------- Forwarded Message --------
  • German: Am <date> schrieb <name>:
  • French: Le <date>, <name> a écrit :
  • Spanish: El <date>, <name> escribió:
  • Generic: > prefixed quote lines

Performance

langmail uses mail-parser under the hood — a zero-copy Rust MIME parser with no external dependencies. The preprocessing pipeline adds minimal overhead on top of the parse step.

Typical throughput on a modern machine: 10,000+ emails/second for plain text messages.

License

MIT OR Apache-2.0


Built by the team behind Marbles. If you need the full pipeline — email ingestion, AI classification, routing, and response generation — check us out.

Dependencies

~6–8.5MB
~154K SLoC