ResearchRubrics

Official code release for the ResearchRubrics project. This repository contains tools for evaluating AI-generated research documents against structured rubric criteria using Large Language Models (LLMs).

Overview

This codebase provides a complete pipeline for evaluating AI-generated research reports (in markdown format) against structured rubric criteria:

Rubric-Based Evaluation: Use LLMs to evaluate whether markdown documents satisfy specific rubric criteria
Batch Processing: Evaluate multiple research reports efficiently with concurrent processing
Compliance Scoring: Calculate compliance scores based on weighted rubric evaluations

Installation

Prerequisites

Python 3.8 or higher
pip package manager
LiteLLM API key (for accessing Gemini 2.5 Pro)

Setup

Clone the repository:

git clone <repository-url>
cd researchrubrics

Install dependencies:

pip install -r requirements.txt

Download the dataset:

mkdir -p data/researchrubrics
huggingface-cli download ScaleAI/researchrubrics processed_data.jsonl --local-dir data/researchrubrics

Configure API credentials:

# Create .env file in project root
echo "LITELLM_API_KEY=your_api_key_here" > .env

Quick Start

# Navigate to project root
cd researchrubrics

# 1. Install dependencies
pip install -r requirements.txt

# 2. Download the dataset
mkdir -p data/researchrubrics
huggingface-cli download ScaleAI/researchrubrics processed_data.jsonl --local-dir data/researchrubrics

# 3. Configure API key
echo "LITELLM_API_KEY=your_api_key_here" > .env

# 4. Place markdown reports in agent_responses/ directory
# (Reports should be named with task IDs, e.g., 683a58c9a7e7fe4e7695846f.md)

# 5. Evaluate all reports
cd src/evaluate_rubrics
python evaluate_reports_batch.py

# 6. Calculate compliance scores
cd ../calculate_metrics
python calculate_compliance_score.py

Repository Structure

researchrubrics/
├── src/
│   ├── __init__.py
│   ├── evaluate_rubrics/          # LLM-based rubric evaluation
│   │   ├── evaluate_single_report.py  # Single markdown evaluation
│   │   └── evaluate_reports_batch.py  # Batch evaluation script
│   ├── calculate_metrics/         # Metrics computation
│   │   └── calculate_compliance_score.py
│   └── prompts/                   # Evaluation prompt templates
│       ├── system_prompt.txt
│       ├── user_prompt.txt
│       ├── chunk_prompt_template.txt
│       └── synthesis_prompt_template.txt
├── data/
│   └── researchrubrics/           # Input data
│       ├── processed_data.jsonl   # Rubrics and metadata (JSONL format)
│       └── README.md              # Dataset documentation
├── agent_responses/               # Input: Markdown reports to evaluate
│   └── [task_id].md               # One file per task
├── results/                       # Evaluation results (JSONL format)
│   └── batch_evaluation_YYYYMMDD_HHMMSS.jsonl
├── cache/                         # Cached markdown conversions
├── tests/
│   └── __init__.py
├── requirements.txt
├── .env                           # Your API key (DO NOT COMMIT)
├── .gitignore
├── setup.py
├── LICENSE
├── CITATION.bib
└── README.md

Usage

Evaluating Single Reports

Evaluate a single markdown report against its rubrics:

import asyncio
from pathlib import Path
from evaluate_single_report import evaluate_task_rubrics

async def main():
    # Evaluate a specific markdown file
    markdown_file = "agent_responses/683a58c9a7e7fe4e7695846f.md"
    results_df, compliance_score = await evaluate_task_rubrics(markdown_file)
    
    # Display results
    print(f"Compliance Score: {compliance_score:.2%}")
    print(f"Evaluated {len(results_df)} rubrics")
    print(f"Total cost: ${results_df['cost'].sum():.4f}")

asyncio.run(main())

Output: Returns a DataFrame with evaluation results and a compliance score.

Batch Evaluation

Evaluate all markdown reports in the agent_responses/ directory:

cd src/evaluate_rubrics
python evaluate_reports_batch.py

Features:

Processes all .md files in agent_responses/
Uses binary grading (Satisfied/Not Satisfied)
Powered by Gemini 2.5 Pro via LiteLLM
Concurrent processing (default: 20 concurrent requests)
Automatic retry logic with exponential backoff

Output: Results saved to results/batch_evaluation_YYYYMMDD_HHMMSS.jsonl

Configuration Options:

model: LLM model to use (default: "litellm_proxy/gemini/gemini-2.5-pro-preview-06-05")
max_concurrent: Maximum concurrent API calls (default: 20)
agent_responses_dir: Directory containing markdown files (default: "agent_responses/")
output_file: Custom output file path (optional)

Calculating Compliance Scores

Calculate compliance scores from evaluation results:

cd src/calculate_metrics
python calculate_compliance_score.py

Scoring:

Binary grading: Satisfied = 1.0, Not Satisfied = 0.0
Compliance Score = Σ(weight × score) / Σ(positive weights)
Excludes negative-weight rubrics from denominator

Output: Displays compliance scores for each evaluated report

Data Format

Input Data (processed_data.jsonl)

The input data file data/researchrubrics/processed_data.jsonl should be downloaded from the ScaleAI/researchrubrics HuggingFace dataset and placed in the data/researchrubrics/ directory. The file contains one JSON object per line:

{
  "prompt": "Task description...",
  "sample_id": "683a58c9a7e7fe4e7695846f",
  "domain": "AI & ML",
  "conceptual_breadth": "Moderate",
  "logical_nesting": "Intermediate",
  "exploration": "Medium",
  "rubrics": [
    {
      "criterion": "Rubric description...",
      "weight": 4.0,
      "axis": "Explicit Criteria"
    }
  ]
}

Fields:

prompt: The research task/question
sample_id: Unique identifier matching markdown filename
domain: Domain category
conceptual_breadth, logical_nesting, exploration: Task complexity metrics
rubrics: Array of evaluation criteria with weights and categories

Markdown Reports (agent_responses/)

Input markdown files should be named with their sample_id (e.g., 683a58c9a7e7fe4e7695846f.md) and contain the AI-generated research report to evaluate.

Evaluation Results (results/*.jsonl)

Output JSONL file with one evaluation result per line:

{
  "sample_id": "683a58c9a7e7fe4e7695846f",
  "rubric_title": "Rubric description...",
  "verdict": "Satisfied",
  "score": 1.0,
  "confidence": 0.95,
  "reasoning": "Detailed explanation...",
  "tokens_used": 4567,
  "cost": 0.0247,
  "success": true,
  "weight": 4.0
}

Configuration

Environment Variables

Create a .env file in the project root:

echo "LITELLM_API_KEY=your_api_key_here" > .env

Required variables:

LITELLM_API_KEY=your_api_key_here

Optional variables (only needed if using a custom LiteLLM proxy):

API_BASE_URL=https://your-custom-endpoint.com

API Configuration

The evaluation scripts use LiteLLM to access Gemini 2.5 Pro:

Set LITELLM_API_KEY in your .env file
Default model: litellm_proxy/gemini/gemini-2.5-pro-preview-06-05
Custom base URL supported via API_BASE_URL environment variable

Model Configuration

In evaluate_single_report.py, you can customize:

evaluator = RubricEvaluator(
    model="litellm_proxy/gemini/gemini-2.5-pro-preview-06-05",
    max_concurrent=20  # Adjust based on rate limits
)

Advanced Features

Document Chunking

For large documents exceeding token limits, the evaluator automatically:

Splits documents into manageable chunks (8000 tokens per chunk)
Evaluates each chunk independently using chunk-specific prompts
Synthesizes chunk findings into a final verdict
Uses separate prompt templates for chunking and synthesis

Caching

Markdown Parsing: Internal caching to avoid redundant parsing
The cache/ directory is reserved for future use

Parallel Processing

Batch evaluation uses asynchronous processing with configurable concurrency:

evaluator = RubricEvaluator(max_concurrent=20)

Adjust max_concurrent based on your API rate limits.

Retry Logic

Automatic retry with exponential backoff:

Maximum 3 retries per request
Exponential backoff: 2^retry_count seconds
Handles rate limits and transient errors gracefully

Evaluation Mode

Binary Grading (Current Implementation)

Classes: Satisfied, Not Satisfied
Scoring: 1.0 for Satisfied, 0.0 for Not Satisfied
Use Case: Clear pass/fail evaluation
Prompts: Uses prompts from src/prompts/ directory

Troubleshooting

Common Issues

API Rate Limits: Reduce max_concurrent in RubricEvaluator initialization
```
evaluator = RubricEvaluator(max_concurrent=5)
```
Missing Input Data: Ensure data/researchrubrics/processed_data.jsonl exists
Missing Markdown Files: Check that markdown files exist in agent_responses/ with matching sample_id names
API Key Issues: Verify .env file is in project root with correct LITELLM_API_KEY

Logging

The scripts use Python's logging module. Adjust logging level:

logging.basicConfig(level=logging.DEBUG)  # For detailed output
logging.basicConfig(level=logging.INFO)   # For standard output
logging.basicConfig(level=logging.WARNING) # For minimal output

Performance

Typical performance metrics:

Single Report Evaluation: ~10-30 seconds (depends on document length and rubric count)
Batch Processing: Processes 20 reports concurrently (configurable)
Token Usage: Varies by document length; typically 3,000-10,000 tokens per rubric evaluation
Cost: Approximately $0.01-$0.05 per rubric evaluation (Gemini 2.5 Pro pricing)

Citation

If you use this code in your research, please cite:

@misc{sharma2025researchrubricsbenchmarkpromptsrubrics,
  title={ResearchRubrics: A Benchmark of Prompts and Rubrics For Evaluating Deep Research Agents}, 
  author={Manasi Sharma and Chen Bo Calvin Zhang and Chaithanya Bandi and Clinton Wang and Ankit Aich and Huy Nghiem and Tahseen Rabbani and Ye Htet and Brian Jang and Sumana Basu and Aishwarya Balwani and Denis Peskoff and Marcos Ayestaran and Sean M. Hendryx and Brad Kenstler and Bing Liu},
  year={2025},
  eprint={2511.07685},
  archivePrefix={arXiv},
  primaryClass={cs.AI},
  url={https://arxiv.org/abs/2511.07685}
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

For questions or issues, please open an issue on GitHub.

Acknowledgments

We thank the contributors and reviewers for their valuable feedback.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cache		cache
data		data
results		results
src		src
tests		tests
.DS_Store		.DS_Store
CITATION.bib		CITATION.bib
DATA_FORMAT.md		DATA_FORMAT.md
FILE_MANIFEST.md		FILE_MANIFEST.md
FOLDER_STRUCTURE.md		FOLDER_STRUCTURE.md
INSTALLATION.md		INSTALLATION.md
LICENSE		LICENSE
PACKAGE_SUMMARY.md		PACKAGE_SUMMARY.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
requirements.txt		requirements.txt
setup.py		setup.py

License

scaleapi/researchrubrics

Folders and files

Latest commit

History

Repository files navigation

ResearchRubrics

Overview

Table of Contents

Installation

Prerequisites

Setup

Quick Start

Repository Structure

Usage

Evaluating Single Reports

Batch Evaluation

Calculating Compliance Scores

Data Format

Input Data (processed_data.jsonl)

Markdown Reports (agent_responses/)

Evaluation Results (results/*.jsonl)

Configuration

Environment Variables

API Configuration

Model Configuration

Advanced Features

Document Chunking

Caching

Parallel Processing

Retry Logic

Evaluation Mode

Binary Grading (Current Implementation)

Troubleshooting

Common Issues

Logging

Performance

Citation

License

Contact

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages