Our new research paper: Adding Error Bars to Evals. AI model evaluations don’t usually include statistics or uncertainty. We think they should. Read the blog post: https://lnkd.in/d2jKfpyT When a new AI model is released, the accompanying model card typically reports a matrix of evaluation scores on a variety of standard evaluations, such as MMLU, GPQA, or the LSAT. But it’s unusual for these scores to include any indication of the uncertainty, or randomness, surrounding them. This omission makes it difficult to compare the evaluation scores of two models in a rigorous way. “Randomness” in language model evaluations may take a couple of forms. Any stream of output tokens from a model may be nondeterministic, and so re-evaluating the same model on the same evaluation may produce slightly different results each time. This randomness is known as measurement error. But there’s another form of randomness that’s not visible by the time an evaluation is performed. This is the sampling error; of all possible questions one could ask about a topic, we decide to include some questions in the evaluation, but not others. In our research paper, we recommend techniques for reducing measurement error and properly quantifying sampling error in model evaluations. With a simple assumption in place—that evaluation questions were randomly drawn from some underlying distribution—we develop an analytic framework for model evaluations using statistical theory. Drawing on the science of experimental design, we make a series of recommendations for performing evaluations and reporting the results in a way that maximizes the amount of information conveyed. Our paper makes five core recommendations. These recommendations will likely not surprise readers with a background in statistics or experimentation, but they are not standard in the world of model evaluations. Specifically, our paper recommends: 1. Computing standard errors using the Central Limit Theorem 2. Using clustered standard errors when questions are drawn in related groups 3. Reducing variance by resampling answers and by analyzing next-token probabilities 4. Using paired analysis when two models are tested on the same questions 5. Conducting power analysis to determine whether an evaluation can answer a specific hypothesis. For mathematical details on the theory behind each recommendation, read the full research paper here: https://lnkd.in/dBrr9zFi.
Anthropic
Research Services
Anthropic is an AI safety and research company working to build reliable, interpretable, and steerable AI systems.
About us
We're an AI research company that builds reliable, interpretable, and steerable AI systems. Our first product is Claude, an AI assistant for tasks at any scale. Our research interests span multiple areas including natural language, human feedback, scaling laws, reinforcement learning, code generation, and interpretability.
- Website
-
https://www.anthropic.com/
External link for Anthropic
- Industry
- Research Services
- Company size
- 501-1,000 employees
- Type
- Privately Held
Employees at Anthropic
Updates
-
We’ve added a new prompt improver to the Anthropic Console. Take an existing prompt and Claude will automatically refine it with prompt engineering techniques like chain-of-thought reasoning. The prompt improver also makes it easy to adapt prompts originally written for other AI models to work better with Claude. Read more: https://lnkd.in/dx-5sp5P.
-
Coinbase customers now get faster and more accurate support with Claude powering their chatbot, help center search, and customer service teams across 100+ countries: https://lnkd.in/gWCvNy2u
Coinbase transforms their customer support with Claude
anthropic.com
-
Read how Asana uses Claude to help 150,000+ companies automate workflows and save countless hours on tasks. https://lnkd.in/dee_PFDr
How Asana transforms work management with Claude for 150,000 global customers
anthropic.com
-
Claude 3.5 Haiku is now available on our API, Amazon Bedrock, and Google Cloud's Vertex AI. Haiku is fast and particularly strong at coding. It outperforms state-of-the-art models—including GPT-4o—on SWE-bench Verified, which measures how models solve real software issues. During final testing, Haiku surpassed Claude 3 Opus, our previous flagship model, on many benchmarks—at a fraction of the cost. As a result, we've increased pricing for Claude 3.5 Haiku to reflect its increase in intelligence: anthropic.com/claude/haiku. Claude 3 Haiku remains available for use cases that benefit from image input or its lower price point: https://lnkd.in/e9yNTtNp.
-
Claude can now view images within a PDF, in addition to text. Enable the feature preview to get started: https://claude.ai/new?fp=1. This helps Claude 3.5 Sonnet more accurately understand complex documents, such as those laden with charts or graphics. The Anthropic API now also supports PDF inputs in beta: https://lnkd.in/emvau9Ez
-
See how Claude helps Hebbia deliver AI-powered document analysis to top financial and legal institutions, turning thousands of pages into actionable insights. https://lnkd.in/e4DnFAxs
How Hebbia uses Claude to transform knowledge work
anthropic.com
-
You can now dictate messages to Claude on our iPhone, iPad, and Android apps. Download on Google Play: anthropic.com/android. Or on the Apple App Store: anthropic.com/ios.
-
The Claude app is now available to download on Mac and Windows: claude.ai/download.
-
Claude is now available on GitHub Copilot. Starting today, developers can select Claude 3.5 Sonnet in Visual Studio Code and GitHub.com. Access will roll out to all Copilot Chat users and organizations over the coming weeks. https://lnkd.in/eaJG3wwM
Claude 3.5 Sonnet on GitHub Copilot
anthropic.com