Skip to content

usemarbles/langmail

Repository files navigation

langmail

Email preprocessing for LLMs. Fast, typed, Rust-powered.

npm CI license

Emails are messy — nested MIME parts, quoted reply chains, HTML cruft, signatures, forwarded headers. LLMs don't need any of that. langmail strips it all away and gives you clean, structured Markdown optimized for language model consumption.

Table of Contents

Install

npm install langmail

Requires Node.js 18 or later. Prebuilt native binaries are included — no Rust toolchain needed.

Quick Start

import { preprocess, preprocessString, toLlmContext } from "langmail";
import { readFileSync } from "fs";

// From a raw .eml file
const raw = readFileSync("message.eml");
const email = preprocess(raw);

// Or from a string (e.g. Gmail API response)
const fromString = preprocessString(rawEmailString);

console.log(email.body);
// → "Hi Alice! Great to hear from you."

console.log(email.from);
// → { name: "Bob", email: "bob@example.com" }

// Format for an LLM prompt
console.log(toLlmContext(email));
// FROM: Bob <bob@example.com>
// TO: Alice <alice@example.com>
// SUBJECT: Re: Project update
// DATE: 2024-01-15T10:30:00Z
// CONTENT:
// Hi Alice! Great to hear from you.

API Reference

preprocess(raw)

Parse and preprocess a raw email from a Buffer.

import { preprocess } from "langmail";
import { readFileSync } from "fs";

const raw = readFileSync("message.eml");
const email = preprocess(raw);

Parameters:

Name Type Description
raw Buffer Raw email bytes (RFC 5322 / EML)

Returns: ProcessedEmail

Throws: If the input cannot be parsed as a valid RFC 5322 message.


preprocessString(raw)

Convenience wrapper that accepts a string instead of a Buffer.

import { preprocessString } from "langmail";

const email = preprocessString(rawEmailString);

Parameters:

Name Type Description
raw string Raw email as string

Returns: ProcessedEmail

Throws: If the input cannot be parsed as a valid RFC 5322 message.


preprocessWithOptions(raw, options)

Preprocess with custom options to control quote stripping, signature removal, and body length.

import { preprocessWithOptions } from "langmail";

const email = preprocessWithOptions(raw, {
  stripQuotes: true,      // Remove quoted replies (default: true)
  stripSignature: true,   // Remove email signatures (default: true)
  maxBodyLength: 4000,    // Truncate body to N chars (default: 0 = no limit)
});

Parameters:

Name Type Description
raw Buffer Raw email bytes
options PreprocessOptions Preprocessing options

Returns: ProcessedEmail

Throws: If the input cannot be parsed as a valid RFC 5322 message.


toLlmContext(email)

Format a ProcessedEmail as a deterministic plain-text block for LLM prompts. Missing fields are omitted; the CONTENT: line is always present.

import { preprocess, toLlmContext } from "langmail";

const email = preprocess(raw);
console.log(toLlmContext(email));
// FROM: Bob <bob@example.com>
// TO: Alice <alice@example.com>
// SUBJECT: Re: Project update
// DATE: 2024-01-15T10:30:00Z
// CONTENT:
// Hi Alice! Great to hear from you.

Parameters:

Name Type Description
email ProcessedEmail A preprocessed email

Returns: string

Never throws.


toLlmContextWithOptions(email, options)

Same as toLlmContext but accepts options to control rendering. Use renderMode: "ThreadHistory" to include quoted reply history as a chronological transcript.

import { preprocess, toLlmContextWithOptions } from "langmail";

const email = preprocess(raw);

// Default: only the latest message
console.log(toLlmContextWithOptions(email, { renderMode: "LatestOnly" }));

// Include thread history
console.log(toLlmContextWithOptions(email, { renderMode: "ThreadHistory" }));
// FROM: Bob <bob@example.com>
// SUBJECT: Re: Project update
// CONTENT:
// Hi Alice! Great to hear from you.
//
// THREAD HISTORY (oldest first):
// ---
// FROM: Alice <alice@example.com>
// DATE: 2024-01-14T09:00:00Z
// Alice's original message here...
// ---

Parameters:

Name Type Description
email ProcessedEmail A preprocessed email
options LlmContextOptions Rendering options

Returns: string

Never throws.

Output Structure

ProcessedEmail

interface ProcessedEmail {
  body: string;                    // Clean Markdown, ready for your LLM
  subject?: string;
  from?: Address;
  to: Address[];
  cc: Address[];
  date?: string;                   // ISO 8601
  rfcMessageId?: string;           // RFC 2822 Message-ID header
  inReplyTo?: string[];            // In-Reply-To header (threading)
  references?: string[];           // References header (threading)
  signature?: string;              // Extracted signature, if found
  rawBodyLength: number;           // Body length before cleaning
  cleanBodyLength: number;         // Body length after cleaning
  primaryCta?: CallToAction;       // Primary call-to-action from HTML body
  threadMessages: ThreadMessage[]; // Quoted replies, oldest first
}

Address

interface Address {
  name?: string;  // Display name (e.g. "Alice")
  email: string;  // Email address (e.g. "alice@example.com")
}

CallToAction

interface CallToAction {
  url: string;        // The URL the action points to
  text: string;       // Human-readable label
  confidence: number; // Score between 0.0 and 1.0
}

ThreadMessage

interface ThreadMessage {
  sender: string;     // Sender attribution (e.g. "Max <max@example.com>")
  timestamp?: string; // ISO 8601, if parseable from the attribution
  body: string;       // Message body (cleaned, no nested quotes)
}

PreprocessOptions

interface PreprocessOptions {
  stripQuotes?: boolean;    // Remove quoted replies (default: true)
  stripSignature?: boolean; // Remove email signatures (default: true)
  maxBodyLength?: number;   // Max body chars, 0 = no limit (default: 0)
}

LlmContextOptions / RenderMode

interface LlmContextOptions {
  renderMode?: RenderMode; // Default: "LatestOnly"
}

// TypeScript enum — JS users pass the string literals directly ("LatestOnly" or "ThreadHistory")
const enum RenderMode {
  /** Only the latest message — all quoted content stripped. */
  LatestOnly = "LatestOnly",
  /** Chronological transcript of quoted replies below the main content. */
  ThreadHistory = "ThreadHistory",
}

Features

  • MIME parsing — handles nested multipart messages, attachments, and encoded headers
  • HTML to Markdown — converts HTML email bodies to clean Markdown, preserving links, headings, and structure
  • Quote stripping — detects and removes quoted replies from Gmail, Outlook, Apple Mail, forwarded messages, and > prefixed lines; supports English, German, French, and Spanish
  • Signature removal — strips signatures (preserved in the signature field); detected via -- delimiter and heuristics
  • CTA extraction — extracts the primary call-to-action from HTML emails via JSON-LD (potentialAction) or heuristic link scoring; filters out unsubscribe/privacy/logo links
  • Thread history — extracts quoted reply blocks into structured ThreadMessage[] (oldest first); render with toLlmContextWithOptions({ renderMode: "ThreadHistory" })
  • Whitespace cleanup — normalizes excessive blank lines and trailing spaces

Performance

langmail uses mail-parser under the hood — a zero-copy Rust MIME parser. The preprocessing pipeline adds minimal overhead on top of the parse step.

Typical throughput on a modern machine: 10,000+ emails/second for plain text messages.

License

MIT OR Apache-2.0


Built by the team behind Marbles.

About

Email preprocessing for LLMs. Fast, typed, Rust-powered.

Topics

Resources

License

Apache-2.0, MIT licenses found

Licenses found

Apache-2.0
LICENSE-APACHE
MIT
LICENSE-MIT

Stars

Watchers

Forks

Packages

 
 
 

Contributors