OpenAI’s GPT-4 exhibits “human-level performance” on professional benchmarks

On Tuesday, OpenAI announced GPT-4, a large multimodal model that can accept text and image inputs while returning text output that "exhibits human-level performance on various professional and academic benchmarks," according to OpenAI. Also on Tuesday, Microsoft announced that Bing Chat has been running on GPT-4 all along.

If it performs as claimed, GPT-4 potentially represents the opening of a new era in artificial intelligence. "It passes a simulated bar exam with a score around the top 10% of test takers," writes OpenAI in its announcement. "In contrast, GPT-3.5’s score was around the bottom 10%."

OpenAI plans to release GPT-4's text capability through ChatGPT and its commercial API, but with a waitlist at first. GPT-4 is currently available to subscribers of ChatGPT Plus. Also, the firm is testing GPT-4's image input capability with a single partner, Be My Eyes, an upcoming smartphone app that can recognize a scene and describe it.

Along with the introductory website, OpenAI also released a technical paper describing GPT-4's capabilities and a system model card describing its limitations in detail.

A screenshot of GPT-4's introduction to ChatGPT Plus customers from March 14, 2023. Credit: Benj Edwards / Ars Technica

GPT stands for "generative pre-trained transformer," and GPT-4 is part of a series of foundational language models extending back to the original GPT in 2018. Following the original release, OpenAI announced GPT-2 in 2019 and GPT-3 in 2020. A further refinement called GPT-3.5 arrived in 2022. In November, OpenAI released ChatGPT, which at that time was a fine-tuned conversational model based on GPT-3.5.

AI models in the GPT series have been trained to predict the next token (a fragment of a word) in a sequence of tokens using a large body of text pulled largely from the Internet. During training, the neural network builds a statistical model that represents relationships between words and concepts. Over time, OpenAI has increased the size and complexity of each GPT model, which has resulted in generally better performance, model-over-model, compared to how a human would complete text in the same scenario, although it varies by task.

As far as tasks go, GPT-4's performance is notable. As with its predecessors, it can follow complex instructions in natural language and generate technical or creative works, but it can do so with more depth: It supports generating and processing up to 32,768 tokens (around 25,000 words of text), which allows for much longer content creation or document analysis than previous models.

While analyzing GPT-4's capabilities, OpenAI made the model take tests like the Uniform Bar Exam, the Law School Admission Test (LSAT), the Graduate Record Examination (GRE) Quantitative, and various AP subject tests. On many of the tasks, it scored at a human level. That means if GPT-4 were a person being judged solely on test-taking ability, it could get into law school—and likely many universities as well.

🤯🤯Well this is something else.

GPT-4 passes basically every exam. And doesn't just pass...
The Bar Exam: 90%
LSAT: 88%
GRE Quantitative: 80%, Verbal: 99%
Every AP, the SAT... pic.twitter.com/zQW3k6uM6Z
— Ethan Mollick (@emollick) March 14, 2023

As for its multimodal capabilities (which are still limited to a research preview), GPT-4 can analyze the content of multiple images and make sense of them, such as understanding a multi-image-sequence joke or extracting information from a diagram. Microsoft and Google have both been experimenting with similar multimodal capabilities recently. In particular, Microsoft thinks that a multimodal approach will be necessary to achieve what AI researchers call "artificial general intelligence," or AI that performs general tasks at the level of a human.

Riley Goodside, staff prompt engineer at Scale AI, referenced "AGI" in a tweet while examining GPT-4's multimodal capabilities, and OpenAI employee Andrej Karpathy expressed surprise that GPT-4 could solve a test he proposed in 2012 about an AI vision model understanding why an image is funny.

GPT-4 multimodal demos.

It’s so over. AGI is coming. pic.twitter.com/ExMwTeOiMa
— Riley Goodside (@goodside) March 14, 2023

OpenAI has stated that its goal is to develop AGI that can replace humans at any intellectual task, although GPT-4 is not there yet. Shortly after the GPT-4 announcement, OpenAI CEO Sam Altman tweeted, "It is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it."

And it's true: GPT-4 is far from perfect. It still reflects biases in its training dataset, hallucinates (makes up plausible-sounding falsehoods), and could potentially generate misinformation or harmful advice.

Microsoft’s unhinged ace in the hole

With the right suggestions, researchers can "trick" a language model to spill their secrets. — Credit: Aurich Lawson | Getty Images

Microsoft's simultaneous GPT-4 announcement means OpenAI has been sitting on GPT-4 since at least November 2022, when Microsoft first tested Bing Chat in India.

"We are happy to confirm that the new Bing is running on GPT-4, customized for search," writes Microsoft in a blog post. "If you’ve used the new Bing in preview at any time in the last six weeks, you’ve already had an early look at the power of OpenAI’s latest model. As OpenAI makes updates to GPT-4 and beyond, Bing benefits from those improvements to ensure our users have the most comprehensive copilot features available."

The Bing Chat timeline matches with an anonymous tip Ars Technica heard last fall that OpenAI had GPT-4 ready internally but was reticent to release it until better guard rails could be implemented. While the nature of Bing Chat's alignment was debatable, GPT-4's guard rails now come in the form of more alignment training. Using a technique called reinforcement learning from human feedback (RLHF), OpenAI used human feedback from GPT-4's results to train the neural network to refuse to discuss topics that OpenAI thinks are sensitive or potentially harmful.

"We’ve spent 6 months iteratively aligning GPT-4 using lessons from our adversarial testing program as well as ChatGPT," OpenAI writes on its website, "resulting in our best-ever results (though far from perfect) on factuality, steerability, and refusing to go outside of guardrails."

Listing image: Ars Technica

Benj Edwards Senior AI Reporter

Benj Edwards is Ars Technica's Senior AI Reporter and founder of the site's dedicated AI beat in 2022. He's also a tech historian with almost two decades of experience. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

224 Comments