Eliza alert: When AI passes this test, look out

For years, AI systems were measured by giving new models a variety of standardized benchmark tests

Author

NYT Editorial Board|25 Jan 2025 2:25 AM

Eliza alert: When AI passes this test, look out

Representative Image

If you’re looking for a new reason to be nervous about artificial intelligence, try this: Some of the smartest humans in the world are struggling to create tests that AI systems can’t pass. For years, AI systems were measured by giving new models a variety of standardized benchmark tests. Many of these tests consisted of challenging, SAT-caliber problems in areas like math, science and logic. Comparing the models’ scores over time served as a rough measure of AI progress. But AI systems eventually got too good at those tests, so new, harder tests were created — often with the types of questions graduate students might encounter on their exams. Those tests aren’t in good shape, either. New models from companies like OpenAI, Google and Anthropic have been getting high scores on many doctorate-level challenges, limiting those tests’ usefulness and leading to a chilling question: Are AI systems getting too smart for us to measure?

This week, researchers at the Center for AI Safety and Scale AI are releasing a possible answer to that question: A new evaluation, called “Humanity’s Last Exam,” that they claim is the hardest test ever administered to AI systems. Humanity’s Last Exam is the brainchild of Dan Hendrycks, a well-known AI safety researcher and director of the Center for AI Safety. (The test’s original name, “Humanity’s Last Stand,” was discarded for being overly dramatic.)

Hendrycks worked with Scale AI, an AI company where he is an adviser, to compile the test, which consists of roughly 3,000 multiple-choice and short answer questions designed to test AI systems’ abilities in areas including analytic philosophy and rocket engineering.

Questions were submitted by experts in these fields, including college professors and prizewinning mathematicians, who were asked to come up with extremely difficult questions they knew the answers to. Here, try your hand at a question about hummingbird anatomy from the test: Hummingbirds within Apodiformes uniquely have a bilaterally paired oval bone, a sesamoid embedded in the caudolateral portion of the expanded, cruciate aponeurosis of insertion of m. depressor caudae. How many paired tendons are supported by this sesamoid bone? Answer with a number. Or, if physics is more your speed, try this one:

A block is placed on a horizontal rail, along which it can slide frictionlessly. It is attached to the end of a rigid, massless rod of length R. A mass is attached at the other end. Both objects have weight W. The system is initially stationary, with the mass directly above the block. The mass is given an infinitesimal push, parallel to the rail. Assume the system is designed so that the rod can rotate through a full 360 degrees without interruption. When the rod is horizontal, it carries tension T1. When the rod is vertical again, with the mass directly below the block, it carries tension T2. (Both these quantities could be negative, which would indicate that the rod is in compression.) What is the value of (T1−T2)/W?

(I would print the answers here, but that would spoil the test for any AI systems being trained on this column. Also, I’m far too dumb to verify the answers myself.) The questions on Humanity’s Last Exam went through a two-step filtering process. First, submitted questions were given to leading AI models to solve. If the models couldn’t answer them (or if, in the case of multiple-choice questions, the models did worse than by random guessing), the questions were given to a set of human reviewers, who refined them and verified the correct answers. Experts who wrote top-rated questions were paid between $500 and $5,000 per question, as well as receiving credit for contributing to the exam. Kevin Zhou, a postdoctoral researcher in theoretical particle physics at the University of California, Berkeley, submitted a handful of questions to the test. Three of his questions were chosen, all of which he told me were “along the upper range of what one might see in a graduate exam.”

Hendrycks, who helped create a widely used AI test known as Massive Multitask Language Understanding, or MMLU, said he was inspired to create harder AI tests by a conversation with Elon Musk. (Hendrycks is also a safety adviser to Musk’s AI company, xAI.) Musk, he said, raised concerns about the existing tests given to AI models, which he thought were too easy.

“Elon looked at the MMLU questions and said, ‘These are undergrad level. I want things that a world-class expert could do,’” Hendrycks said. There are other tests trying to measure advanced AI capabilities in certain domains, such as FrontierMath, a test developed by Epoch AI, and ARC-AGI, a test developed by AI researcher François Chollet. But Humanity’s Last Exam is aimed at determining how good AI systems are at answering complex questions across a wide variety of academic subjects, giving us what might be thought of as a general intelligence score. “We are trying to estimate the extent to which AI can automate a lot of really difficult intellectual labor,” Hendrycks said. Once the list of questions had been compiled, the researchers gave Humanity’s Last Exam to six leading AI models, including Google’s Gemini 1.5 Pro and Anthropic’s Claude 3.5 Sonnet. All of them failed miserably. OpenAI’s o1 system scored the highest of the bunch, with a score of 8.3%.

Part of what’s so confusing about AI progress these days is how jagged it is. We have AI models capable of diagnosing diseases more effectively than human doctors, winning silver medals at the International Math Olympiad and beating top human programmers on competitive coding challenges. But these same models sometimes struggle with basic tasks, like arithmetic or writing metered poetry. That has given them a reputation as astoundingly brilliant at some things and totally useless at others, and it has created vastly different impressions of how fast AI is improving, depending on whether you’re looking at the best or the worst outputs. That jaggedness has also made measuring these models hard. I wrote last year that we need better evaluations for AI systems. I still believe that. But I also believe that we need more creative methods of tracking AI progress that don’t rely on standardized tests, because most of what humans do — and what we fear AI will do better than us — can’t be captured on a written exam.

Zhou, the theoretical particle physics researcher who submitted questions to Humanity’s Last Exam, told me that while AI models were often impressive at answering complex questions, he didn’t consider them a threat to him and his colleagues, because their jobs involve much more than spitting out correct answers. “There’s a big gulf between what it means to take an exam and what it means to be a practicing physicist and researcher,” he said. “Even an AI that can answer these questions might not be ready to help in research, which is inherently less structured.”

artificial intelligence (AI)OpenAI’s ChatGPT

Eliza alert: When AI passes this test, look out

For years, AI systems were measured by giving new models a variety of standardized benchmark tests

NYT Editorial Board

Related Articles

Most Read