Skip to content

aashirpersonal/semantic-coverage

Repository files navigation

🏳️ semantic-coverage

The "Code Coverage" tool for RAG Knowledge Bases.
Automated detection of knowledge gaps, hallucination spots, and representation bias in Vector Databases.

License Python Status

🛑 The Problem

In software engineering, we track Code Coverage to prevent bugs.
In AI engineering, we ship RAG (Retrieval Augmented Generation) systems without Semantic Coverage.

Engineers often don't know:

  1. Blind Spots: What are users asking that our Vector DB has zero context for?
  2. Data Drift: How is user intent shifting away from our indexed documentation over time?
  3. Hallucination Triggers: Which clusters of queries systematically yield low-confidence retrieval?

⚡ The Solution: semantic-coverage

This tool provides semantic observability by projecting both Documents (Knowledge) and User Queries (Intent) into a shared latent space (using UMAP). It then uses density-based clustering (HDBSCAN) to identify "Red Zones"—areas of high user density but low document density.

Dashboard Preview

🛠️ Tech Stack

  • Math Engine: Sentence-Transformers (SBERT), UMAP, HDBSCAN, Scikit-Learn
  • Backend: FastAPI (Async inference)
  • Frontend: React + Vite, Plotly.js (Interactive Scatter Plots)
  • Extensibility: Plugin architecture for Vector DBs

🚀 Quick Start

1. Installation

git clone https://github.com/aashirpersonal/semantic-coverage.git
cd semantic-coverage

# Backend Setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Frontend Setup
cd frontend
npm install

2. Run the Stack

# Terminal 1: Backend
uvicorn app.main:app --reload

# Terminal 2: Frontend
npm run dev

3. Usage

Navigate to http://localhost:5173. Paste your JSON export of queries and documents. The system will auto-generate a "Gap Report" identifying missing topics.

🔌 Enterprise Connectors

semantic-coverage is designed to be database-agnostic. We support a plugin architecture for major Vector Stores:

from app.core.connectors import get_connector

# Connect to Pinecone
db = get_connector("pinecone", api_key="...", index_name="knowledge-base-v1")
docs = db.fetch_documents(limit=5000)

# Connect to ChromaDB
db = get_connector("chroma", collection_name="support_tickets")
docs = db.fetch_documents()

🏗️ Architecture

  1. Ingestion: Text is converted to 384-dim embeddings (all-MiniLM-L6-v2).
  2. Projection: High-dimensional vectors are reduced to 2D via UMAP.
  3. Clustering: User queries are clustered to find distinct "Topics."
  4. Gap Analysis: For each query cluster, we calculate the Centroid Distance to the nearest Document neighbor.
  5. Scoring: Clusters exceeding the distance threshold (0.7) are flagged as blind_spot.

📜 License

MIT


About

Automated detection of knowledge gaps and blind spots in RAG vector stores.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published