
Photo by Igor Omilaev on Unsplash
Arc Prize Foundation Launches Challenging New AGI Benchmark, Exposing AI Weaknesses
The non-profit Arc Prize Foundation announced a new benchmark, ARC-AGI-2, to challenge frontier AI models on reasoning and human-level capacities on Monday. The organization also announced a new contest, ARC Prize 2025, that will take place from March to November, and the winner will earn a $700,000 Grand Prize.
In a rush? Here are the quick facts:
- The Arc Prize Foundation launched a new benchmark called ARC-AGI-2 to test AI models on human-level reasoning skills.
- Current top AI models failed the test, scoring between 0.0% and 4%, while humans scored up to 100%.
- The non-profit organization also announced the competition ARC Prize 2025 for the benchmark, and the winner will earn a $700,000 prize.
According to the information shared by the organization, the most popular AI models in the market haven’t been able to surpass a 4% score on ARC-AGI-2, while humans can easily solve the test.
“Today we’re excited to launch ARC-AGI-2 to challenge the new frontier,” states the announcement. “ARC-AGI-2 is even harder for AI (in particular, AI reasoning systems), while maintaining the same relative ease for humans.”
ARC-AGI-2 is the second edition of the organization’s benchmark, ARC-AGI-1, launched in 2019. On the previous test, only OpenAI’s o3 successfully scored 85% in December 2024.
This new version focuses on tasks that are easy for humans and hard for AI models—or impossible until now. Unlike other benchmarks, ARC-AGI-2 doesn’t consider PhD skills or superhuman capabilities, instead, tasks evaluate adaptation capacity and problem-solving skills by applying existing knowledge.
Arc Prize explained that every task in the test was solved by humans in less than 2 attempts, and AI models must comply with similar rules, considering the lowest costs. The test includes symbolic interpretation—AI models must understand symbols beyond visual patterns—, considering simultaneous rules, and rules that change depending on context—something most AI reasoning systems fail at.
The organization tested the new benchmark with humans and public AI models. Human panels scored 100% and 60% while popular frontier systems such as DeepSeek’s R1 and R1-zero scored 0.3%, and GPT-4.5’s pure LLM and o3-mini-high scored 0.0%. OpenAI’s o3-low using Chain-of-Thought reasoning, search, and synthesis reached an estimate of 4%, at a high cost per task.
Arc Prize also launched the latest open-source contest, ARC Prize 2025, hosted between March and November at the popular online platform Kaggle. The first team to reach a score higher than 85%—and a $2.5/task efficiency—on the ARC-AGI-2 benchmark will earn a $700,000 Grand Prize. There will also be paper awards and other prizes for top scores.
The foundation said that more details will be provided on the official website and in the upcoming days.
Leave a Comment
Cancel