Highest score on Humanity's Last Exam before Dec 31, 2026?
Short Answer
1. Executive Verdict
- AI models consistently achieve rapidly increasing scores on Humanity's Last Exam.
- Architectural innovations from major developers target improved reasoning capabilities.
- Humanity's Last Exam reveals frontier model gaps in domain-specific reasoning.
- Performance of OpenAI and Claude models are key market catalysts.
- Public leaderboards regularly track AI model performance on Humanity's Last Exam.
- Models may optimize exam scores without genuine understanding, researchers acknowledge.
Who Wins and Why
| Outcome | Market | Model | Why |
|---|---|---|---|
| At least 90% | 7.0% | 8.6% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
| At least 85% | 6.0% | 8.6% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
| At least 50% | 87.0% | 87.1% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
| At least 80% | 11.0% | 13.1% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
| At least 75% | 21.0% | 23.8% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
Current Context
2. Market Behavior & Price Dynamics
Historical Price (Probability)
3. Market Data
Contract Snapshot
This market resolves to "Yes" if any language model achieves at least 60% accuracy on Humanity's Last Exam, with the outcome verified by agi.safe.ai, before December 31, 2026. If this condition is not met by the deadline, the market resolves to "No." The market closes early if the event occurs, otherwise it expires on December 31, 2026, and insider trading by those employed by source agencies or with material non-public information is prohibited.
Available Contracts
Market options and current pricing
| Outcome bucket | Yes (price) | No (price) | Last trade probability |
|---|---|---|---|
| At least 50% | $0.84 | $0.17 | 87% |
| At least 55% | $0.63 | $0.38 | 63% |
| At least 60% | $0.56 | $0.45 | 56% |
| At least 65% | $0.40 | $0.61 | 39% |
| At least 75% | $0.20 | $0.81 | 21% |
| At least 80% | $0.15 | $0.86 | 11% |
| At least 90% | $0.07 | $0.94 | 7% |
| At least 85% | $0.07 | $0.94 | 6% |
| At least 70% | $0.31 | $0.70 | 0% |
Market Discussion
As of May 20, 2026, a model achieved the highest reported score of 44.7% on Humanity's Last Exam (HLE), with another model also reported with a 46.4% score [^][^]. This benchmark, developed to evaluate advanced AI on 2,500 challenging, graduate-level questions requiring multi-step reasoning, aims to set a higher bar as earlier AI benchmarks became saturated [^][^][^][^][^][^]. Although initial scores in early 2025 for leading AI models were significantly lower, scores have climbed dramatically since then [^][^].
4. What architectural innovations are OpenAI's and Google DeepMind's 2025-2026 model pipelines expected to introduce that could significantly boost performance on Humanity's Last Exam?
| DeepMind Titans Architecture | Explicitly designed for test-time long-term memory [^] |
|---|---|
| OpenAI Pipeline Innovations | Integrates reasoning depth, tool use, and agentic, long-horizon behavior [^][^] |
| DeepMind Memory Mechanism | Surprise/retention mechanism and adaptive forgetting [^][^] |
5. What specific question categories in Humanity's Last Exam have caused current frontier models like GPT-4 and Claude 3 to consistently fail, and what do these failures reveal about their core reasoning gaps?
| Questions on exam | 2,500 to 3,000 [^][^][^] |
|---|---|
| Model Accuracy on Exam | Low [^][^][^] |
| Domains of failure | Deeply domain-specific knowledge, e.g., ancient languages, microanatomy [^] |
6. How do the research approaches of Google DeepMind and Anthropic differ in their focus on emergent reasoning versus AI safety, and which philosophy is better suited for the challenges posed by Humanity's Last Exam?
| DeepMind AGI framework | 10 key cognitive faculties [^][^] |
|---|---|
| Anthropic AI alignment | Constitutional AI guides AI to be helpful, harmless, and honest [^][^][^][^][^][^] |
| HLE AI performance | Current leading AI models score quite low [^][^] |
7. What public leaderboards or datasets tracking performance on Humanity's Last Exam are available, and how frequently are they updated by model developers like OpenAI and Google?
8. What is the consensus among AI alignment researchers and benchmark creators regarding the risk of models optimizing for the exam score without achieving genuine understanding before the end of 2026?
| HLE Paper Online Publication Date | 28 Jan 2026 [^][^] |
|---|---|
| Kalshi Market Resolution Date | Dec 31, 2026 [^] |
| Benchmark Vulnerability | Prominent benchmarks exploitable for near-perfect scores [^] |
9. What Could Change the Odds
Key Catalysts
Key Dates & Catalysts
- Strike Date: December 31, 2026
- Expiration: December 31, 2026
- Closes: December 31, 2026
10. Decision-Flipping Events
- Trigger: Key catalysts include the performance of AI models from major developers, specifically OpenAI GPT and Claude.
- Trigger: Polymarket contracts resolve based on whether any OpenAI GPT model reaches a specified threshold on the official HLE leaderboard by June 30, 2026 11:59 PM ET [^] .
- Trigger: Similarly, a Claude model's performance on the official HLE leaderboard at or above a specified percentage by June 30, 2026 11:59 PM ET is a resolution condition for another market example [^] .
- Trigger: The dynamic nature of the Humanity's Last Exam (HLE) leaderboards means that ongoing updates and new model evaluations can lead to score shifts [^] [^] .
12. Historical Resolutions
No historical resolution data available for this series.
Get Real-Time Research Updates
Sign up for early access to live reports, historical data, and AI-powered market insights delivered to your inbox.