Stars
RewardBench: the first evaluation tool for reward models.
Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!
Data annotation toolbox supports image, audio and video data.
The Open-Source Data Annotation Platform
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Evaluate your LLM's response with Prometheus and GPT4 💯
Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks
[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues
HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance
An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)
Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]
A benchmark for emotional intelligence in large language models
Arena-Hard-Auto: An automatic LLM benchmark.
An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset