Official code release for the ResearchRubrics project. This repository contains tools for evaluating AI-generated research documents against structured rubric criteria using Large Language Models (LLMs).
This codebase provides a complete pipeline for evaluating AI-generated research reports (in markdown format) against structured rubric criteria:
- Rubric-Based Evaluation: Use LLMs to evaluate whether markdown documents satisfy specific rubric criteria
- Batch Processing: Evaluate multiple research reports efficiently with concurrent processing
- Compliance Scoring: Calculate compliance scores based on weighted rubric evaluations
- Python 3.8 or higher
- pip package manager
- LiteLLM API key (for accessing Gemini 2.5 Pro)
- Clone the repository:
git clone <repository-url>
cd researchrubrics- Install dependencies:
pip install -r requirements.txt- Download the dataset:
mkdir -p data/researchrubrics
huggingface-cli download ScaleAI/researchrubrics processed_data.jsonl --local-dir data/researchrubrics- Configure API credentials:
# Create .env file in project root
echo "LITELLM_API_KEY=your_api_key_here" > .env# Navigate to project root
cd researchrubrics
# 1. Install dependencies
pip install -r requirements.txt
# 2. Download the dataset
mkdir -p data/researchrubrics
huggingface-cli download ScaleAI/researchrubrics processed_data.jsonl --local-dir data/researchrubrics
# 3. Configure API key
echo "LITELLM_API_KEY=your_api_key_here" > .env
# 4. Place markdown reports in agent_responses/ directory
# (Reports should be named with task IDs, e.g., 683a58c9a7e7fe4e7695846f.md)
# 5. Evaluate all reports
cd src/evaluate_rubrics
python evaluate_reports_batch.py
# 6. Calculate compliance scores
cd ../calculate_metrics
python calculate_compliance_score.pyresearchrubrics/
├── src/
│ ├── __init__.py
│ ├── evaluate_rubrics/ # LLM-based rubric evaluation
│ │ ├── evaluate_single_report.py # Single markdown evaluation
│ │ └── evaluate_reports_batch.py # Batch evaluation script
│ ├── calculate_metrics/ # Metrics computation
│ │ └── calculate_compliance_score.py
│ └── prompts/ # Evaluation prompt templates
│ ├── system_prompt.txt
│ ├── user_prompt.txt
│ ├── chunk_prompt_template.txt
│ └── synthesis_prompt_template.txt
├── data/
│ └── researchrubrics/ # Input data
│ ├── processed_data.jsonl # Rubrics and metadata (JSONL format)
│ └── README.md # Dataset documentation
├── agent_responses/ # Input: Markdown reports to evaluate
│ └── [task_id].md # One file per task
├── results/ # Evaluation results (JSONL format)
│ └── batch_evaluation_YYYYMMDD_HHMMSS.jsonl
├── cache/ # Cached markdown conversions
├── tests/
│ └── __init__.py
├── requirements.txt
├── .env # Your API key (DO NOT COMMIT)
├── .gitignore
├── setup.py
├── LICENSE
├── CITATION.bib
└── README.md
Evaluate a single markdown report against its rubrics:
import asyncio
from pathlib import Path
from evaluate_single_report import evaluate_task_rubrics
async def main():
# Evaluate a specific markdown file
markdown_file = "agent_responses/683a58c9a7e7fe4e7695846f.md"
results_df, compliance_score = await evaluate_task_rubrics(markdown_file)
# Display results
print(f"Compliance Score: {compliance_score:.2%}")
print(f"Evaluated {len(results_df)} rubrics")
print(f"Total cost: ${results_df['cost'].sum():.4f}")
asyncio.run(main())Output: Returns a DataFrame with evaluation results and a compliance score.
Evaluate all markdown reports in the agent_responses/ directory:
cd src/evaluate_rubrics
python evaluate_reports_batch.pyFeatures:
- Processes all
.mdfiles inagent_responses/ - Uses binary grading (Satisfied/Not Satisfied)
- Powered by Gemini 2.5 Pro via LiteLLM
- Concurrent processing (default: 20 concurrent requests)
- Automatic retry logic with exponential backoff
Output: Results saved to results/batch_evaluation_YYYYMMDD_HHMMSS.jsonl
Configuration Options:
model: LLM model to use (default: "litellm_proxy/gemini/gemini-2.5-pro-preview-06-05")max_concurrent: Maximum concurrent API calls (default: 20)agent_responses_dir: Directory containing markdown files (default: "agent_responses/")output_file: Custom output file path (optional)
Calculate compliance scores from evaluation results:
cd src/calculate_metrics
python calculate_compliance_score.pyScoring:
- Binary grading: Satisfied = 1.0, Not Satisfied = 0.0
- Compliance Score = Σ(weight × score) / Σ(positive weights)
- Excludes negative-weight rubrics from denominator
Output: Displays compliance scores for each evaluated report
The input data file data/researchrubrics/processed_data.jsonl should be downloaded from the ScaleAI/researchrubrics HuggingFace dataset and placed in the data/researchrubrics/ directory. The file contains one JSON object per line:
{
"prompt": "Task description...",
"sample_id": "683a58c9a7e7fe4e7695846f",
"domain": "AI & ML",
"conceptual_breadth": "Moderate",
"logical_nesting": "Intermediate",
"exploration": "Medium",
"rubrics": [
{
"criterion": "Rubric description...",
"weight": 4.0,
"axis": "Explicit Criteria"
}
]
}Fields:
prompt: The research task/questionsample_id: Unique identifier matching markdown filenamedomain: Domain categoryconceptual_breadth,logical_nesting,exploration: Task complexity metricsrubrics: Array of evaluation criteria with weights and categories
Input markdown files should be named with their sample_id (e.g., 683a58c9a7e7fe4e7695846f.md) and contain the AI-generated research report to evaluate.
Output JSONL file with one evaluation result per line:
{
"sample_id": "683a58c9a7e7fe4e7695846f",
"rubric_title": "Rubric description...",
"verdict": "Satisfied",
"score": 1.0,
"confidence": 0.95,
"reasoning": "Detailed explanation...",
"tokens_used": 4567,
"cost": 0.0247,
"success": true,
"weight": 4.0
}Create a .env file in the project root:
echo "LITELLM_API_KEY=your_api_key_here" > .envRequired variables:
LITELLM_API_KEY=your_api_key_here
Optional variables (only needed if using a custom LiteLLM proxy):
API_BASE_URL=https://your-custom-endpoint.com
The evaluation scripts use LiteLLM to access Gemini 2.5 Pro:
- Set
LITELLM_API_KEYin your.envfile - Default model:
litellm_proxy/gemini/gemini-2.5-pro-preview-06-05 - Custom base URL supported via
API_BASE_URLenvironment variable
In evaluate_single_report.py, you can customize:
evaluator = RubricEvaluator(
model="litellm_proxy/gemini/gemini-2.5-pro-preview-06-05",
max_concurrent=20 # Adjust based on rate limits
)For large documents exceeding token limits, the evaluator automatically:
- Splits documents into manageable chunks (8000 tokens per chunk)
- Evaluates each chunk independently using chunk-specific prompts
- Synthesizes chunk findings into a final verdict
- Uses separate prompt templates for chunking and synthesis
- Markdown Parsing: Internal caching to avoid redundant parsing
- The
cache/directory is reserved for future use
Batch evaluation uses asynchronous processing with configurable concurrency:
evaluator = RubricEvaluator(max_concurrent=20)Adjust max_concurrent based on your API rate limits.
Automatic retry with exponential backoff:
- Maximum 3 retries per request
- Exponential backoff: 2^retry_count seconds
- Handles rate limits and transient errors gracefully
- Classes: Satisfied, Not Satisfied
- Scoring: 1.0 for Satisfied, 0.0 for Not Satisfied
- Use Case: Clear pass/fail evaluation
- Prompts: Uses prompts from
src/prompts/directory
-
API Rate Limits: Reduce
max_concurrentinRubricEvaluatorinitializationevaluator = RubricEvaluator(max_concurrent=5)
-
Missing Input Data: Ensure
data/researchrubrics/processed_data.jsonlexists -
Missing Markdown Files: Check that markdown files exist in
agent_responses/with matchingsample_idnames -
API Key Issues: Verify
.envfile is in project root with correctLITELLM_API_KEY
The scripts use Python's logging module. Adjust logging level:
logging.basicConfig(level=logging.DEBUG) # For detailed output
logging.basicConfig(level=logging.INFO) # For standard output
logging.basicConfig(level=logging.WARNING) # For minimal outputTypical performance metrics:
- Single Report Evaluation: ~10-30 seconds (depends on document length and rubric count)
- Batch Processing: Processes 20 reports concurrently (configurable)
- Token Usage: Varies by document length; typically 3,000-10,000 tokens per rubric evaluation
- Cost: Approximately $0.01-$0.05 per rubric evaluation (Gemini 2.5 Pro pricing)
If you use this code in your research, please cite:
@misc{sharma2025researchrubricsbenchmarkpromptsrubrics,
title={ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents},
author={Manasi Sharma and Chen Bo Calvin Zhang and Chaithanya Bandi and Clinton Wang and Ankit Aich and Huy Nghiem and Tahseen Rabbani and Ye Htet and Brian Jang and Sumana Basu and Aishwarya Balwani and Denis Peskoff and Marcos Ayestaran and Sean M. Hendryx and Brad Kenstler and Bing Liu},
year={2025},
eprint={2511.07685},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2511.07685}
}This project is licensed under the MIT License - see the LICENSE file for details.
For questions or issues, please open an issue on GitHub.
We thank the contributors and reviewers for their valuable feedback.