Skip to content
View bittersweet1999's full-sized avatar

Block or report bittersweet1999

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results
Python 50 2 Updated Nov 7, 2024

RewardBench: the first evaluation tool for reward models.

Python 459 54 Updated Dec 11, 2024

Official repository for "Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing". Your efficient and high-quality synthetic data generation pipeline!

Python 524 55 Updated Nov 5, 2024

Data annotation toolbox supports image, audio and video data.

Python 904 87 Updated Nov 25, 2024

The Open-Source Data Annotation Platform

TypeScript 598 48 Updated Nov 6, 2024

A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。

Python 21,193 1,507 Updated Dec 13, 2024

Evaluate your LLM's response with Prometheus and GPT4 💯

Python 815 49 Updated Nov 29, 2024

Open-source evaluation toolkit of large vision-language models (LVLMs), support 160+ VLMs, 50+ benchmarks

Python 1,498 209 Updated Dec 13, 2024

[ACL 2024] MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

52 24 Updated Jul 24, 2024

HuixiangDou: Overcoming Group Chat Scenarios with LLM-based Technical Assistance

Python 1,547 128 Updated Oct 29, 2024

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)

Python 4,064 320 Updated Nov 8, 2024

Official github repo for the paper "Compression Represents Intelligence Linearly" [COLM 2024]

Python 127 6 Updated Sep 20, 2024

A benchmark for emotional intelligence in large language models

Python 206 18 Updated Jul 26, 2024

Arena-Hard-Auto: An automatic LLM benchmark.

Jupyter Notebook 681 80 Updated Dec 14, 2024

An automatic evaluator for instruction-following language models. Human-validated, high-quality, cheap, and fast.

Jupyter Notebook 1,562 245 Updated Nov 11, 2024

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

Python 4,277 455 Updated Dec 13, 2024

[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset

86 1 Updated Jul 12, 2024