Skip to content

percy-raskova/marxists.org-rag-db

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

45 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Marxists Internet Archive RAG Pipeline

Author/Lead Developer: Persephone Raskova Project Philosophy: Non-sectarian preservation of Marxist history Current Scale: 50GB (optimized from 200GB - see ADR-001)

⚠️ OPTIMIZATION UPDATE: Corpus strategically reduced from 200GB to 50GB for feasibility.

Key Documents:

Complete enterprise-scale pipeline for converting the 200GB Marxists Internet Archive into a queryable RAG system on Google Cloud Platform.

πŸ”₯ What This Does

Converts 200GB corpus (5-10M documents) of Marxist theory into:

  • Clean markdown with preserved metadata (GCS storage)
  • Semantically chunked text with Parquet storage
  • Vector embeddings via Runpod GPU rental ($40-60 total!)
  • Weaviate vector database (handles 10M+ vectors)
  • Enterprise-scale, cloud-hosted knowledge base

Perfect for:

  • Material analysis research
  • Class composition studies
  • Theoretical framework development
  • Building AI tools for organizing work

πŸ“‹ Prerequisites

System Requirements

  • Python 3.9+
  • Google Cloud Platform account
  • Terraform 1.5+
  • 200GB corpus (torrent or mirror)
  • Runpod.io account with $100 credits

Cloud Infrastructure Required

  • GCP Project with billing enabled
  • GCS Buckets for storage (see TERRAFORM.md)
  • GKE Cluster for Weaviate (3+ nodes)
  • Runpod GPU rental (RTX 4090 recommended)

Install Dependencies

# Core Python packages
pip install -r requirements.txt --break-system-packages

# Weaviate client (primary for 200GB scale)
pip install weaviate-client --break-system-packages

# Google Cloud SDK
curl https://sdk.cloud.google.com | bash
gcloud auth application-default login

# Terraform for infrastructure
brew install terraform  # macOS
# or see TERRAFORM.md for other platforms

# For embeddings (Runpod, not Ollama!)
pip install sentence-transformers torch

πŸš€ Quick Start (200GB Scale)

Step 0: Deploy Cloud Infrastructure

# Deploy GCP infrastructure with Terraform
cd terraform/environments/prod
terraform init
terraform apply

# See TERRAFORM.md for complete setup

Step 1: Get the 200GB Archive

Option A: Internet Archive Torrent (200GB complete archive)

# Torrent: https://archive.org/details/dump_www-marxists-org
# Upload to GCS after download:
gsutil -m cp -r dump_www-marxists-org/* gs://mia-raw-torrent/

Option B: GitHub Mirror (HTML only, smaller)

git clone https://github.com/emijrp/www.marxists.org.git
gsutil -m cp -r www.marxists.org/* gs://mia-raw-torrent/

Step 2: Download MIA Metadata

python mia_processor.py --download-json

This fetches:

  • authors.json - All 850+ authors with links to their works
  • sections.json - Subject/topic organization
  • periodicals.json - Revolutionary publications archive

Step 3: Process Archive

# Process Internet Archive dump
python mia_processor.py --process-archive ~/Downloads/dump_www-marxists-org/

# Or process GitHub mirror
python mia_processor.py --process-archive ~/marxists-mirror/

# Custom output location
python mia_processor.py \
    --process-archive ~/Downloads/dump_www-marxists-org/ \
    --output ~/my-rag-data/

What this does:

  • Converts HTML β†’ Markdown
  • Converts PDF β†’ Markdown
  • Filters to English content only
  • Preserves metadata (author, title, date, source URL)
  • Removes navigation/boilerplate
  • Generates content hashes for deduplication

Output structure:

~/marxists-processed/
β”œβ”€β”€ markdown/          # Converted documents with frontmatter
β”œβ”€β”€ metadata/          # JSON metadata for each document
β”œβ”€β”€ json/              # Downloaded MIA metadata
└── processing_report.json

Processing time:

  • HTML: ~1-2 hours for 126k pages
  • PDFs: ~3-6 hours for 38k documents (OCR intensive)
  • Total: ~4-8 hours depending on hardware

Step 4: Ingest to Vector DB

Chroma (easiest, local-only):

python rag_ingest.py \
    --db chroma \
    --markdown-dir ~/marxists-processed/markdown/ \
    --persist-dir ./mia_vectordb/

Qdrant (better performance, local or cloud):

# Local Qdrant
python rag_ingest.py \
    --db qdrant \
    --markdown-dir ~/marxists-processed/markdown/ \
    --persist-dir ./mia_vectordb/

# Remote Qdrant
python rag_ingest.py \
    --db qdrant \
    --markdown-dir ~/marxists-processed/markdown/ \
    --qdrant-url http://localhost:6333

Chunking strategies:

  • --strategy semantic (default) - Respects paragraph boundaries, good for theory
  • --strategy section - Chunks by headers, preserves document structure
  • --strategy token - Fixed token count, predictable but may split mid-thought

Chunk size:

  • --chunk-size 512 (default) - Good for most LLMs
  • --chunk-size 1024 - For models with larger context windows
  • --chunk-size 256 - More granular retrieval

Embedding models (via Ollama):

  • nomic-embed-text (default) - 768d, balanced
  • mxbai-embed-large - 1024d, higher quality
  • all-minilm - 384d, faster/smaller

πŸ“Š Statistics & Confidence

Processing Pipeline Confidence: 85%

Limitations:

  • PDF OCR quality varies (pre-1990s works may have errors)
  • Some nested HTML structures may lose formatting
  • Non-English detection is heuristic-based (~5% false positives)
  • Author extraction from paths is ~70% accurate

RAG Ingestion Confidence: 90%

Known issues:

  • Very long works (>50k words) may need manual chunking review
  • Mathematical notation in PDFs often garbled
  • Some diagrams/images lost in text conversion

πŸ” Querying Your RAG

Chroma Example

import chromadb
from chromadb.config import Settings

client = chromadb.Client(Settings(
    persist_directory="./mia_vectordb/",
    anonymized_telemetry=False
))

collection = client.get_collection("marxist_theory")

# Query
results = collection.query(
    query_texts=["What is the theory of surplus value?"],
    n_results=5
)

for doc, metadata in zip(results['documents'][0], results['metadatas'][0]):
    print(f"\n--- {metadata['title']} by {metadata['author']} ---")
    print(doc[:500])

Qdrant Example

from qdrant_client import QdrantClient
import requests

client = QdrantClient(path="./mia_vectordb/")

# Get query embedding
response = requests.post(
    'http://localhost:11434/api/embeddings',
    json={
        "model": "nomic-embed-text",
        "prompt": "What is the theory of surplus value?"
    }
)
query_embedding = response.json()['embedding']

# Search
results = client.search(
    collection_name="marxist_theory",
    query_vector=query_embedding,
    limit=5
)

for result in results:
    print(f"\n--- {result.payload['title']} ---")
    print(result.payload['content'][:500])

πŸ› οΈ Integration with PercyBrain

To integrate with your Zettelkasten via MCP:

# In your MCP server tool definitions
def search_marxist_theory(query: str, n_results: int = 5):
    """Search the Marxist theory corpus"""
    # Use Chroma/Qdrant client here
    results = collection.query(query_texts=[query], n_results=n_results)
    return format_results(results)

Add to your mcp_config.json:

{
  "mcpServers": {
    "marxist-theory": {
      "command": "python",
      "args": ["/path/to/marxist_theory_mcp.py"],
      "env": {
        "VECTOR_DB_PATH": "./mia_vectordb/"
      }
    }
  }
}

πŸ“ Chunking Strategy Rationale

Why semantic chunking for theory?

Marxist texts have specific rhetorical structure:

  1. Thesis statement
  2. Historical/material evidence
  3. Dialectical synthesis
  4. Practical implications

Semantic chunking preserves these argumentative units better than arbitrary token counts.

Example from Capital Vol. 1:

BAD (token-based):
Chunk 1: "The value of labour-power is determined, as in the case of every other commodity, by the labour-time necessary for the production, and consequently also the reproduction, of this special article. So far as it has value, it represents no more than a definite quantity of the average labour of society incorporated in it. Labour-power exists only as a capacity, or power of the living individual. Its production consequently pre-supposes his"

Chunk 2: "existence. Given the individual, the production of labour-power consists in his reproduction of himself or his maintenance. For his maintenance he requires a given quantity of the means of subsistence."

GOOD (semantic):
Chunk 1: "The value of labour-power is determined, as in the case of every other commodity, by the labour-time necessary for the production, and consequently also the reproduction, of this special article. So far as it has value, it represents no more than a definite quantity of the average labour of society incorporated in it. Labour-power exists only as a capacity, or power of the living individual. Its production consequently pre-supposes his existence. Given the individual, the production of labour-power consists in his reproduction of himself or his maintenance. For his maintenance he requires a given quantity of the means of subsistence."

🎯 Use Cases

1. Material Analysis Assistant

def analyze_locality(zip_code: str):
    # Get census data for area
    census_data = fetch_census_api(zip_code)
    
    # Query relevant theory
    theory_context = search_marxist_theory(
        f"class composition {census_data['dominant_industry']} workers"
    )
    
    # Synthesize
    return generate_analysis(census_data, theory_context)

2. Theoretical Framework Search

# Find relevant frameworks for a specific organizing question
results = search_marxist_theory(
    "organizing lumpenproletariat declassed workers",
    n_results=10
)

# Returns: George Jackson, Mao on lumpen class, Fanon, etc.

3. Historical Precedent Lookup

# What did revolutionaries say about X situation?
results = search_marxist_theory(
    "strike tactics railroad workers organizing",
    n_results=20
)

# Cross-reference with your own SICA notes

πŸ” Security & Privacy

No data leaves your machine:

  • Vector DB is local
  • Embeddings generated locally via Ollama
  • No API calls to OpenAI/Anthropic needed
  • Full control over data access

OPSEC considerations:

  • Store on encrypted volume (LUKS)
  • Keep backups on separate encrypted drives
  • Vector DB files contain full text - protect accordingly
  • Query logs (if you add them) should be secured

πŸ› Troubleshooting

"Import Error: No module named X"

pip install <module> --break-system-packages

"Ollama connection refused"

# Start Ollama service
ollama serve

# Or check if already running
ps aux | grep ollama

"PDF extraction is slow"

PDF processing is CPU-intensive. Options:

  • Use --skip-pdfs flag (add it to script)
  • Process in batches
  • Use nice to lower priority: nice -n 19 python mia_processor.py ...

"Out of memory during ingestion"

Reduce batch size in rag_ingest.py:

# Around line 200, add batching:
for i in range(0, len(all_chunks), 100):
    batch = all_chunks[i:i+100]
    self.ingest_chroma(batch)

"Non-English content slipping through"

Edit is_english_content() in mia_processor.py to add more language filters.

πŸ“ˆ Future Enhancements

Phase 2 ideas:

  • Temporal metadata extraction (decade, revolutionary period)
  • Automatic concept linking (references between works)
  • Subject taxonomy from MIA sections.json
  • Cross-reference with contemporary sources
  • Integration with local news scrapers
  • Automated theory β†’ practice mapping

MCP tool ideas:

  • synthesize_from_theory(situation: str) - Apply theory to current events
  • find_precedents(organizing_context: str) - Historical examples
  • critique_analysis(text: str) - Theoretical critique of a take

🀝 Contributing

This is DIY infrastructure for the people. Fork it, hack it, share it.

Improvements needed:

  • Better author extraction
  • Footnote preservation
  • Cross-document reference detection
  • Multi-language support

πŸ“š Related Projects

⚠️ Copyright Notice

MIA content licensing varies:

  • Public domain works (Marx, Engels, Lenin, etc.) - freely usable
  • Translations may have copyright
  • Contemporary authors - check individual licenses
  • MIA-created material: Creative Commons Attribution-ShareAlike 2.0

This pipeline is for personal research and education. Respect volunteer labor that built MIA.

πŸ“ž Questions?

This is experimental infrastructure. Expect bugs. Fix them and share.

Built for organizing, not profit.


"The philosophers have only interpreted the world, in various ways. The point, however, is to change it." - Marx, Theses on Feuerbach

About

retrieval augmented database project for marxists.org

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors 3

  •  
  •  
  •