Author/Lead Developer: Persephone Raskova Project Philosophy: Non-sectarian preservation of Marxist history Current Scale: 50GB (optimized from 200GB - see ADR-001)
β οΈ OPTIMIZATION UPDATE: Corpus strategically reduced from 200GB to 50GB for feasibility.Key Documents:
- ARCHITECTURE.md - Complete system architecture
- RUNPOD.md - GPU rental strategy ($40-60 total!)
- BOUNDARIES.md - Parallel development coordination
- START_HERE.md - Quick start guide
Complete enterprise-scale pipeline for converting the 200GB Marxists Internet Archive into a queryable RAG system on Google Cloud Platform.
Converts 200GB corpus (5-10M documents) of Marxist theory into:
- Clean markdown with preserved metadata (GCS storage)
- Semantically chunked text with Parquet storage
- Vector embeddings via Runpod GPU rental ($40-60 total!)
- Weaviate vector database (handles 10M+ vectors)
- Enterprise-scale, cloud-hosted knowledge base
Perfect for:
- Material analysis research
- Class composition studies
- Theoretical framework development
- Building AI tools for organizing work
- Python 3.9+
- Google Cloud Platform account
- Terraform 1.5+
- 200GB corpus (torrent or mirror)
- Runpod.io account with $100 credits
- GCP Project with billing enabled
- GCS Buckets for storage (see TERRAFORM.md)
- GKE Cluster for Weaviate (3+ nodes)
- Runpod GPU rental (RTX 4090 recommended)
# Core Python packages
pip install -r requirements.txt --break-system-packages
# Weaviate client (primary for 200GB scale)
pip install weaviate-client --break-system-packages
# Google Cloud SDK
curl https://sdk.cloud.google.com | bash
gcloud auth application-default login
# Terraform for infrastructure
brew install terraform # macOS
# or see TERRAFORM.md for other platforms
# For embeddings (Runpod, not Ollama!)
pip install sentence-transformers torch# Deploy GCP infrastructure with Terraform
cd terraform/environments/prod
terraform init
terraform apply
# See TERRAFORM.md for complete setupOption A: Internet Archive Torrent (200GB complete archive)
# Torrent: https://archive.org/details/dump_www-marxists-org
# Upload to GCS after download:
gsutil -m cp -r dump_www-marxists-org/* gs://mia-raw-torrent/Option B: GitHub Mirror (HTML only, smaller)
git clone https://github.com/emijrp/www.marxists.org.git
gsutil -m cp -r www.marxists.org/* gs://mia-raw-torrent/python mia_processor.py --download-jsonThis fetches:
authors.json- All 850+ authors with links to their workssections.json- Subject/topic organizationperiodicals.json- Revolutionary publications archive
# Process Internet Archive dump
python mia_processor.py --process-archive ~/Downloads/dump_www-marxists-org/
# Or process GitHub mirror
python mia_processor.py --process-archive ~/marxists-mirror/
# Custom output location
python mia_processor.py \
--process-archive ~/Downloads/dump_www-marxists-org/ \
--output ~/my-rag-data/What this does:
- Converts HTML β Markdown
- Converts PDF β Markdown
- Filters to English content only
- Preserves metadata (author, title, date, source URL)
- Removes navigation/boilerplate
- Generates content hashes for deduplication
Output structure:
~/marxists-processed/
βββ markdown/ # Converted documents with frontmatter
βββ metadata/ # JSON metadata for each document
βββ json/ # Downloaded MIA metadata
βββ processing_report.json
Processing time:
- HTML: ~1-2 hours for 126k pages
- PDFs: ~3-6 hours for 38k documents (OCR intensive)
- Total: ~4-8 hours depending on hardware
Chroma (easiest, local-only):
python rag_ingest.py \
--db chroma \
--markdown-dir ~/marxists-processed/markdown/ \
--persist-dir ./mia_vectordb/Qdrant (better performance, local or cloud):
# Local Qdrant
python rag_ingest.py \
--db qdrant \
--markdown-dir ~/marxists-processed/markdown/ \
--persist-dir ./mia_vectordb/
# Remote Qdrant
python rag_ingest.py \
--db qdrant \
--markdown-dir ~/marxists-processed/markdown/ \
--qdrant-url http://localhost:6333Chunking strategies:
--strategy semantic(default) - Respects paragraph boundaries, good for theory--strategy section- Chunks by headers, preserves document structure--strategy token- Fixed token count, predictable but may split mid-thought
Chunk size:
--chunk-size 512(default) - Good for most LLMs--chunk-size 1024- For models with larger context windows--chunk-size 256- More granular retrieval
Embedding models (via Ollama):
nomic-embed-text(default) - 768d, balancedmxbai-embed-large- 1024d, higher qualityall-minilm- 384d, faster/smaller
Processing Pipeline Confidence: 85%
Limitations:
- PDF OCR quality varies (pre-1990s works may have errors)
- Some nested HTML structures may lose formatting
- Non-English detection is heuristic-based (~5% false positives)
- Author extraction from paths is ~70% accurate
RAG Ingestion Confidence: 90%
Known issues:
- Very long works (>50k words) may need manual chunking review
- Mathematical notation in PDFs often garbled
- Some diagrams/images lost in text conversion
import chromadb
from chromadb.config import Settings
client = chromadb.Client(Settings(
persist_directory="./mia_vectordb/",
anonymized_telemetry=False
))
collection = client.get_collection("marxist_theory")
# Query
results = collection.query(
query_texts=["What is the theory of surplus value?"],
n_results=5
)
for doc, metadata in zip(results['documents'][0], results['metadatas'][0]):
print(f"\n--- {metadata['title']} by {metadata['author']} ---")
print(doc[:500])from qdrant_client import QdrantClient
import requests
client = QdrantClient(path="./mia_vectordb/")
# Get query embedding
response = requests.post(
'http://localhost:11434/api/embeddings',
json={
"model": "nomic-embed-text",
"prompt": "What is the theory of surplus value?"
}
)
query_embedding = response.json()['embedding']
# Search
results = client.search(
collection_name="marxist_theory",
query_vector=query_embedding,
limit=5
)
for result in results:
print(f"\n--- {result.payload['title']} ---")
print(result.payload['content'][:500])To integrate with your Zettelkasten via MCP:
# In your MCP server tool definitions
def search_marxist_theory(query: str, n_results: int = 5):
"""Search the Marxist theory corpus"""
# Use Chroma/Qdrant client here
results = collection.query(query_texts=[query], n_results=n_results)
return format_results(results)Add to your mcp_config.json:
{
"mcpServers": {
"marxist-theory": {
"command": "python",
"args": ["/path/to/marxist_theory_mcp.py"],
"env": {
"VECTOR_DB_PATH": "./mia_vectordb/"
}
}
}
}Why semantic chunking for theory?
Marxist texts have specific rhetorical structure:
- Thesis statement
- Historical/material evidence
- Dialectical synthesis
- Practical implications
Semantic chunking preserves these argumentative units better than arbitrary token counts.
Example from Capital Vol. 1:
BAD (token-based):
Chunk 1: "The value of labour-power is determined, as in the case of every other commodity, by the labour-time necessary for the production, and consequently also the reproduction, of this special article. So far as it has value, it represents no more than a definite quantity of the average labour of society incorporated in it. Labour-power exists only as a capacity, or power of the living individual. Its production consequently pre-supposes his"
Chunk 2: "existence. Given the individual, the production of labour-power consists in his reproduction of himself or his maintenance. For his maintenance he requires a given quantity of the means of subsistence."
GOOD (semantic):
Chunk 1: "The value of labour-power is determined, as in the case of every other commodity, by the labour-time necessary for the production, and consequently also the reproduction, of this special article. So far as it has value, it represents no more than a definite quantity of the average labour of society incorporated in it. Labour-power exists only as a capacity, or power of the living individual. Its production consequently pre-supposes his existence. Given the individual, the production of labour-power consists in his reproduction of himself or his maintenance. For his maintenance he requires a given quantity of the means of subsistence."
def analyze_locality(zip_code: str):
# Get census data for area
census_data = fetch_census_api(zip_code)
# Query relevant theory
theory_context = search_marxist_theory(
f"class composition {census_data['dominant_industry']} workers"
)
# Synthesize
return generate_analysis(census_data, theory_context)# Find relevant frameworks for a specific organizing question
results = search_marxist_theory(
"organizing lumpenproletariat declassed workers",
n_results=10
)
# Returns: George Jackson, Mao on lumpen class, Fanon, etc.# What did revolutionaries say about X situation?
results = search_marxist_theory(
"strike tactics railroad workers organizing",
n_results=20
)
# Cross-reference with your own SICA notesNo data leaves your machine:
- Vector DB is local
- Embeddings generated locally via Ollama
- No API calls to OpenAI/Anthropic needed
- Full control over data access
OPSEC considerations:
- Store on encrypted volume (LUKS)
- Keep backups on separate encrypted drives
- Vector DB files contain full text - protect accordingly
- Query logs (if you add them) should be secured
pip install <module> --break-system-packages# Start Ollama service
ollama serve
# Or check if already running
ps aux | grep ollamaPDF processing is CPU-intensive. Options:
- Use
--skip-pdfsflag (add it to script) - Process in batches
- Use
niceto lower priority:nice -n 19 python mia_processor.py ...
Reduce batch size in rag_ingest.py:
# Around line 200, add batching:
for i in range(0, len(all_chunks), 100):
batch = all_chunks[i:i+100]
self.ingest_chroma(batch)Edit is_english_content() in mia_processor.py to add more language filters.
Phase 2 ideas:
- Temporal metadata extraction (decade, revolutionary period)
- Automatic concept linking (references between works)
- Subject taxonomy from MIA sections.json
- Cross-reference with contemporary sources
- Integration with local news scrapers
- Automated theory β practice mapping
MCP tool ideas:
synthesize_from_theory(situation: str)- Apply theory to current eventsfind_precedents(organizing_context: str)- Historical examplescritique_analysis(text: str)- Theoretical critique of a take
This is DIY infrastructure for the people. Fork it, hack it, share it.
Improvements needed:
- Better author extraction
- Footnote preservation
- Cross-document reference detection
- Multi-language support
MIA content licensing varies:
- Public domain works (Marx, Engels, Lenin, etc.) - freely usable
- Translations may have copyright
- Contemporary authors - check individual licenses
- MIA-created material: Creative Commons Attribution-ShareAlike 2.0
This pipeline is for personal research and education. Respect volunteer labor that built MIA.
This is experimental infrastructure. Expect bugs. Fix them and share.
Built for organizing, not profit.
"The philosophers have only interpreted the world, in various ways. The point, however, is to change it." - Marx, Theses on Feuerbach