The "Code Coverage" tool for RAG Knowledge Bases.
Automated detection of knowledge gaps, hallucination spots, and representation bias in Vector Databases.
In software engineering, we track Code Coverage to prevent bugs.
In AI engineering, we ship RAG (Retrieval Augmented Generation) systems without Semantic Coverage.
Engineers often don't know:
- Blind Spots: What are users asking that our Vector DB has zero context for?
- Data Drift: How is user intent shifting away from our indexed documentation over time?
- Hallucination Triggers: Which clusters of queries systematically yield low-confidence retrieval?
This tool provides semantic observability by projecting both Documents (Knowledge) and User Queries (Intent) into a shared latent space (using UMAP). It then uses density-based clustering (HDBSCAN) to identify "Red Zones"—areas of high user density but low document density.
- Math Engine:
Sentence-Transformers(SBERT),UMAP,HDBSCAN,Scikit-Learn - Backend: FastAPI (Async inference)
- Frontend: React + Vite, Plotly.js (Interactive Scatter Plots)
- Extensibility: Plugin architecture for Vector DBs
git clone https://github.com/aashirpersonal/semantic-coverage.git
cd semantic-coverage
# Backend Setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
# Frontend Setup
cd frontend
npm install# Terminal 1: Backend
uvicorn app.main:app --reload
# Terminal 2: Frontend
npm run devNavigate to http://localhost:5173. Paste your JSON export of queries and documents. The system will auto-generate a "Gap Report" identifying missing topics.
semantic-coverage is designed to be database-agnostic. We support a plugin architecture for major Vector Stores:
from app.core.connectors import get_connector
# Connect to Pinecone
db = get_connector("pinecone", api_key="...", index_name="knowledge-base-v1")
docs = db.fetch_documents(limit=5000)
# Connect to ChromaDB
db = get_connector("chroma", collection_name="support_tickets")
docs = db.fetch_documents()- Ingestion: Text is converted to 384-dim embeddings (all-MiniLM-L6-v2).
- Projection: High-dimensional vectors are reduced to 2D via UMAP.
- Clustering: User queries are clustered to find distinct "Topics."
- Gap Analysis: For each query cluster, we calculate the Centroid Distance to the nearest Document neighbor.
- Scoring: Clusters exceeding the distance threshold (0.7) are flagged as
blind_spot.
MIT
