🏳️ semantic-coverage

The "Code Coverage" tool for RAG Knowledge Bases.
Automated detection of knowledge gaps, hallucination spots, and representation bias in Vector Databases.

🛑 The Problem

In software engineering, we track Code Coverage to prevent bugs.
In AI engineering, we ship RAG (Retrieval Augmented Generation) systems without Semantic Coverage.

Engineers often don't know:

Blind Spots: What are users asking that our Vector DB has zero context for?
Data Drift: How is user intent shifting away from our indexed documentation over time?
Hallucination Triggers: Which clusters of queries systematically yield low-confidence retrieval?

⚡ The Solution: `semantic-coverage`

This tool provides semantic observability by projecting both Documents (Knowledge) and User Queries (Intent) into a shared latent space (using UMAP). It then uses density-based clustering (HDBSCAN) to identify "Red Zones"—areas of high user density but low document density.

🛠️ Tech Stack

Math Engine: Sentence-Transformers (SBERT), UMAP, HDBSCAN, Scikit-Learn
Backend: FastAPI (Async inference)
Frontend: React + Vite, Plotly.js (Interactive Scatter Plots)
Extensibility: Plugin architecture for Vector DBs

🚀 Quick Start

1. Installation

git clone https://github.com/aashirpersonal/semantic-coverage.git
cd semantic-coverage

# Backend Setup
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

# Frontend Setup
cd frontend
npm install

2. Run the Stack

# Terminal 1: Backend
uvicorn app.main:app --reload

# Terminal 2: Frontend
npm run dev

3. Usage

Navigate to http://localhost:5173. Paste your JSON export of queries and documents. The system will auto-generate a "Gap Report" identifying missing topics.

🔌 Enterprise Connectors

semantic-coverage is designed to be database-agnostic. We support a plugin architecture for major Vector Stores:

from app.core.connectors import get_connector

# Connect to Pinecone
db = get_connector("pinecone", api_key="...", index_name="knowledge-base-v1")
docs = db.fetch_documents(limit=5000)

# Connect to ChromaDB
db = get_connector("chroma", collection_name="support_tickets")
docs = db.fetch_documents()

🏗️ Architecture

Ingestion: Text is converted to 384-dim embeddings (all-MiniLM-L6-v2).
Projection: High-dimensional vectors are reduced to 2D via UMAP.
Clustering: User queries are clustered to find distinct "Topics."
Gap Analysis: For each query cluster, we calculate the Centroid Distance to the nearest Document neighbor.
Scoring: Clusters exceeding the distance threshold (0.7) are flagged as blind_spot.

📜 License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
app		app
frontend		frontend
.gitignore		.gitignore
README.md		README.md
dashboard-preview.png		dashboard-preview.png
demo_v1.py		demo_v1.py
demo_v2.py		demo_v2.py
fintech_test.json		fintech_test.json
generate_fintech_data.py		generate_fintech_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏳️ semantic-coverage

🛑 The Problem

⚡ The Solution: `semantic-coverage`

🛠️ Tech Stack

🚀 Quick Start

1. Installation

2. Run the Stack

3. Usage

🔌 Enterprise Connectors

🏗️ Architecture

📜 License

About

Uh oh!

Releases

Packages

Languages

aashirpersonal/semantic-coverage

Folders and files

Latest commit

History

Repository files navigation

🏳️ semantic-coverage

🛑 The Problem

⚡ The Solution: semantic-coverage

🛠️ Tech Stack

🚀 Quick Start

1. Installation

2. Run the Stack

3. Usage

🔌 Enterprise Connectors

🏗️ Architecture

📜 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

⚡ The Solution: `semantic-coverage`

Packages