Short Answer

Both the model and the market overwhelmingly agree that the highest score on Humanity's Last Exam will be at least 50%. This consensus is supported by AI models consistently achieving rapidly increasing scores due to architectural innovations targeting improved reasoning.

1. Executive Verdict

  • AI models consistently achieve rapidly increasing scores on Humanity's Last Exam.
  • Architectural innovations from major developers target improved reasoning capabilities.
  • Humanity's Last Exam reveals frontier model gaps in domain-specific reasoning.
  • Performance of OpenAI and Claude models are key market catalysts.
  • Public leaderboards regularly track AI model performance on Humanity's Last Exam.
  • Models may optimize exam scores without genuine understanding, researchers acknowledge.

Who Wins and Why

Outcome Market Model Why
At least 90% 7.0% 8.6% Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam.
At least 85% 6.0% 8.6% Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam.
At least 50% 87.0% 87.1% Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam.
At least 80% 11.0% 13.1% Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam.
At least 75% 21.0% 23.8% Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam.

Current Context

Humanity's Last Exam challenges AI with complex, multimodal reasoning tasks. This benchmark comprises 2,500 expert-level questions spanning mathematics, physics, biology, computer science, and humanities, with approximately 14% requiring both text and image understanding [^][^][^]. The exam was developed to supersede earlier AI benchmarks like MMLU, which advanced models had saturated by achieving over 90% accuracy [^][^]. In stark contrast, current frontier AI models typically score between 3% and 13% on Humanity's Last Exam, indicating substantial limitations in complex, human-level reasoning [^][^].
AI in 2026 anticipates collaboration, agentic systems, and ethical challenges. AI is evolving from an experimental tool into a collaborative partner, influencing work, creation, and problem-solving [^][^][^]. AI agents are projected to become more prevalent, functioning as digital coworkers capable of autonomously executing multi-step tasks, such as booking travel or managing complex business operations [^][^][^][^]. However, widespread enterprise adoption of agentic AI systems may not fully materialize until 2027 due to existing infrastructure limitations and the necessity for robust data strategies and governance [^]. The focus is shifting towards augmenting human capabilities through AI rather than replacing them, seen in applications within medicine, software development, and scientific research where AI assists in generating hypotheses and controlling experiments [^][^][^]. Small Language Models (SLMs) are gaining traction for their efficiency, offering performance comparable to larger models with reduced computational demands, and are expected to power the majority of enterprise AI initiatives, addressing concerns about energy consumption and data privacy [^][^]. Efforts are also underway in Neuromorphic AI to develop brain-like computing for lower power consumption, countering the significant energy demands of modern GPUs [^]. Companies are increasingly prioritizing measurable business impact and return on investment from AI initiatives, moving beyond initial experimentation towards industrialization and optimization [^][^][^]. Experts anticipate a year of profound impact for AI in 2026, with many believing the future lies in AI amplifying human potential through collaboration [^][^][^]. Concerns are rising regarding the accelerated erosion of trust due to increasingly convincing and scalable AI-generated media, known as deepfakes, which blur the lines between reality and fabrication [^][^]. Building trust through robust security protections and ethical design is considered essential to mitigate risks associated with the proliferation of AI agents, with predictions of an increase in "death by AI" legal claims due to insufficient risk guardrails [^][^]. Some observers also question the sustainability of current AI spending, suggesting a potential "AI bubble" as revenues might underwhelm and large language models could be plateauing in performance, with technical and economic limits, including the trillions of dollars required for training and operating frontier models, becoming harder to disregard [^][^].
Key AI events in 2026 will precede prediction market outcomes. Major conferences scheduled for 2026 include AAAI (January 20-27), Gartner Data & Analytics Summit (March 9-11), NVIDIA GTC AI Conference (March 16-19), HumanX (April 6-9), SANS AI Cybersecurity Summit (April 20-27), AI Con USA (June 7-12), SuperAI (June 10-11), The AI Summit London (June 10-11), Databricks Data + AI Summit (June 15-18), AI World Congress (June 23-24), Ai4 (August 4-6), CDAO Government (September 23-24), The AI Conference San Francisco (September 29 - October 1), World Summit AI (October 7-8), and Big Data Conference Europe (November 25-27) [^][^][^][^]. On the Kalshi prediction market for the "Best AI of 2026," Anthropic's Claude is currently favored to top the LM Arena Leaderboard by December 31, 2026 [^]. This follows the April 2026 release of Claude 4.7 Opus, which is highly regarded for complex coding and long-form reasoning [^]. Google's Gemini, particularly after the March 2026 rollout of Gemini 3.1 Pro and 3.1 Flash, has also established a strong position, and OpenAI's ChatGPT has achieved significant adoption, especially among older demographics [^]. The market remains open until December 31, 2026, with further model developments anticipated to influence the odds [^].

2. Market Behavior & Price Dynamics

Historical Price (Probability)

Outcome probability
Date
This prediction market has exhibited a distinct and rapid upward trend. The price opened at 79.0% and immediately jumped to 87.0%, where it has since stabilized. This significant initial movement established a strong bullish sentiment from the outset. The context provided explains that Humanity's Last Exam is a highly difficult benchmark on which current AI models perform poorly, scoring between 3% and 13%. The market's high probability suggests traders are largely discounting current limitations and are instead pricing in an expectation of extremely rapid advancements in AI's complex reasoning capabilities before the market's 2026 resolution date.
The price action has established clear technical levels, with an initial support at 79.0% and a new, firm level at 87.0% that has acted as both a ceiling and the current support. While a total of 350 contracts have been traded, suggesting a degree of market participation, the provided sample data points show no volume, which could indicate that trading activity is sporadic. Overall, the chart reflects a very high degree of market conviction that a high score will be achieved. The sentiment is strongly optimistic about future AI progress, implying a belief that the performance gap noted in current reports will be closed well before the end of 2026.

3. Market Data

View on Kalshi →

Contract Snapshot

This market resolves to "Yes" if any language model achieves at least 60% accuracy on Humanity's Last Exam, with the outcome verified by agi.safe.ai, before December 31, 2026. If this condition is not met by the deadline, the market resolves to "No." The market closes early if the event occurs, otherwise it expires on December 31, 2026, and insider trading by those employed by source agencies or with material non-public information is prohibited.

Available Contracts

Market options and current pricing

Outcome bucket Yes (price) No (price) Last trade probability
At least 50% $0.84 $0.17 87%
At least 55% $0.63 $0.38 63%
At least 60% $0.56 $0.45 56%
At least 65% $0.40 $0.61 39%
At least 75% $0.20 $0.81 21%
At least 80% $0.15 $0.86 11%
At least 90% $0.07 $0.94 7%
At least 85% $0.07 $0.94 6%
At least 70% $0.31 $0.70 0%

Market Discussion

As of May 20, 2026, a model achieved the highest reported score of 44.7% on Humanity's Last Exam (HLE), with another model also reported with a 46.4% score [^][^]. This benchmark, developed to evaluate advanced AI on 2,500 challenging, graduate-level questions requiring multi-step reasoning, aims to set a higher bar as earlier AI benchmarks became saturated [^][^][^][^][^][^]. Although initial scores in early 2025 for leading AI models were significantly lower, scores have climbed dramatically since then [^][^].

4. What architectural innovations are OpenAI's and Google DeepMind's 2025-2026 model pipelines expected to introduce that could significantly boost performance on Humanity's Last Exam?

DeepMind Titans ArchitectureExplicitly designed for test-time long-term memory [^]
OpenAI Pipeline InnovationsIntegrates reasoning depth, tool use, and agentic, long-horizon behavior [^][^]
DeepMind Memory MechanismSurprise/retention mechanism and adaptive forgetting [^][^]
OpenAI and Google DeepMind are both advancing AI for sustained reasoning. OpenAI’s pipeline innovations focus on integrating reasoning depth, effective tool use, and agentic, long-horizon behavior into their flagship models, moving towards a single model line that combines reasoning, tool use, and conversational quality [^][^]. Google DeepMind’s Titans architecture, conversely, is specifically engineered for test-time long-term memory to enhance performance on exams requiring sustained reasoning [^][^].
Google DeepMind's Titans architecture specializes in long-term memory management. Incorporating MIRAS, it features a long-term memory module that uses attention to decide whether to include stored summaries [^][^]. This architecture also employs a “surprise” or retention mechanism to differentiate between momentary and past surprise, and utilizes adaptive forgetting with weight decay to manage finite memory capacity over extremely long sequences, which is crucial for tasks demanding sustained reasoning [^][^].
OpenAI enhances agentic capabilities for complex multi-step reasoning. Their 2025 developer materials indicate a strategy of integrating reasoning depth, tool use, and conversational quality into a unified model line [^]. Pipeline enhancements include agent-building blocks such as the Responses API, Agents SDK, and AgentKit, which enable complex multi-step workflows [^]. The GPT‑5.5 announcement further emphasizes agentic, long-horizon behavior, demonstrating improved performance on agentic coding and intricate command-line tasks that necessitate planning, iteration, and tool coordination, all vital for exams requiring sustained reasoning [^].

5. What specific question categories in Humanity's Last Exam have caused current frontier models like GPT-4 and Claude 3 to consistently fail, and what do these failures reveal about their core reasoning gaps?

Questions on exam2,500 to 3,000 [^][^][^]
Model Accuracy on ExamLow [^][^][^]
Domains of failureDeeply domain-specific knowledge, e.g., ancient languages, microanatomy [^]
Humanity's Last Exam challenges frontier models with domain-specific knowledge. This academic benchmark is a closed-ended test featuring between 2,500 and 3,000 expert-vetted questions across dozens of subjects [^][^][^]. It is specifically designed to assess deeply domain-specific knowledge rather than general web-retrieval capabilities [^][^][^]. Despite the availability of known, unambiguous solutions for the exam, current frontier models such as GPT-4 and Claude 3 consistently demonstrate low accuracy and calibration on these assessments [^][^][^].
Models struggle with specialized expertise and lack specific failure analysis. These models frequently exhibit failures in domains demanding highly specialized expertise [^]. Examples of these struggles include tasks like translating ancient languages, specifically Palmyrene inscriptions, and identifying microanatomical structures in birds within complex biological microanatomy [^]. However, existing research does not provide an official mapping of specific 'Science and Technology question categories' to the consistent failures observed in these models [^][^][^]. Furthermore, a structured taxonomy detailing their core reasoning gaps is not available [^][^][^]. Most sources predominantly discuss aggregate low performance and calibration, offering illustrative examples rather than comprehensive category-level failure analytics [^][^][^].

6. How do the research approaches of Google DeepMind and Anthropic differ in their focus on emergent reasoning versus AI safety, and which philosophy is better suited for the challenges posed by Humanity's Last Exam?

DeepMind AGI framework10 key cognitive faculties [^][^]
Anthropic AI alignmentConstitutional AI guides AI to be helpful, harmless, and honest [^][^][^][^][^][^]
HLE AI performanceCurrent leading AI models score quite low [^][^]
Google DeepMind and Anthropic pursue distinct AI research philosophies. DeepMind heavily emphasizes emergent reasoning, aiming for complex, intelligent behaviors from scaled AI systems, with its approach to AGI inspired by neuroscience through hierarchical and multi-task learning [^]. DeepMind proposes measuring AGI progress by deconstructing general intelligence into 10 key cognitive faculties, including reasoning, learning, and metacognition [^][^][^]. To mitigate risks from powerful AGI, DeepMind operates an AGI Safety Council [^][^][^][^][^]. In contrast, Anthropic's primary focus is on AI safety and alignment, striving to build reliable, interpretable, and steerable AI systems [^][^][^][^]. Their hallmark method, "Constitutional AI" (CAI), trains AI systems through self-improvement guided by a predefined "constitution" of human values and principles to ensure they are helpful, harmless, and honest [^][^][^][^][^][^]. Anthropic also explores "Teaching Why," an approach centered on explanation-driven learning, which they believe can significantly improve generalization and foster deeper internal reasoning structures for safer AI deployment [^].
Humanity's Last Exam tests advanced reasoning, a key differentiator for AI. This benchmark is specifically designed to assess true reasoning ability and deep understanding, often requiring graduate-level expertise and multi-modality, rather than mere memorization [^][^][^][^][^]. Current leading AI models generally score low on HLE, highlighting a significant gap in their capacity for complex reasoning and contextual understanding [^][^]. DeepMind's direct emphasis on emergent reasoning, broad cognitive abilities, and universal learning systems aligns closely with HLE's demands for profound understanding and the ability to tackle novel, expert-level problems across diverse academic fields [^][^][^][^][^]. While Anthropic's "Teaching Why" research could indirectly contribute to improved HLE performance by instilling deeper reasoning structures for safety [^], DeepMind's core research objectives appear to be more directly geared towards excelling at the explicit demands of such a comprehensive reasoning benchmark [^][^][^][^][^].

7. What public leaderboards or datasets tracking performance on Humanity's Last Exam are available, and how frequently are they updated by model developers like OpenAI and Google?

Claude Mythos Preview Score64.7% [^]
GPT-5.5 Pro Score57.2% [^]
Gemini 3.5 Pro Score46.9% [^]
Public leaderboards regularly track AI model performance on Humanity's Last Exam. Several prominent platforms monitor and display these results, including those from Scale Labs, Artificial Analysis, LLM Stats, Price Per Token, and Epoch AI [^][^][^][^][^]. The LLM Stats leaderboard is notably updated to reflect the latest model performances, and both the LLM Stats and Price Per Token leaderboards indicated recent updates as of May 20, 2026 [^][^].
Recent updates highlight varied performance from major model developers. Google DeepMind reported scores for its Gemini series, with Gemini 3.5 Flash achieving an academic reasoning score of 40.2% on the full HLE and Gemini 3.5 Pro scoring 46.9%, both as of May 19, 2026 [^]. OpenAI's GPT-5.5 Pro recorded a score of 57.2% on HLE as of May 20, 2026 [^]. Anthropic's Claude Mythos Preview currently leads the LLM Stats leaderboard with a score of 64.7% [^]. Earlier performances include Gemini 3 Deep Think in February 2026 [^], and Gemini 3 Pro in November 2025 [^].

8. What is the consensus among AI alignment researchers and benchmark creators regarding the risk of models optimizing for the exam score without achieving genuine understanding before the end of 2026?

HLE Paper Online Publication Date28 Jan 2026 [^][^]
Kalshi Market Resolution DateDec 31, 2026 [^]
Benchmark VulnerabilityProminent benchmarks exploitable for near-perfect scores [^]
AI alignment researchers widely acknowledge the significant risk of models optimizing for exam scores without achieving genuine understanding. This consensus among AI alignment researchers and benchmark creators stems from documented vulnerabilities in existing benchmarks and general arguments within AI alignment regarding the use of proxy objectives.
Existing benchmarks are vulnerable to exploitation for high scores. Benchmark evaluators have documented instances where prominent benchmarks can be exploited, enabling models to achieve near-perfect scores without genuinely solving the intended tasks [^]. This problem arises because evaluations are often not designed to withstand score optimization over the true objective. Paul Christiano further explains how "unaligned/competitive" optimization pressures can emerge from benchmarks, allowing systems to succeed without demonstrating the underlying behavior, thus turning the benchmark into a gameable proxy [^].
Humanity's Last Exam (HLE) authors caution against equating scores with understanding. Its official Nature paper, published online on January 28, 2026, describes the benchmark as resistant to simple internet lookup or database retrieval [^][^]. However, the paper also reports that state-of-the-art large language models currently exhibit low accuracy and calibration on HLE, and it explicitly notes that score gains should not automatically be equated with deeper autonomous understanding [^][^]. The Kalshi prediction market directly incentivizes the "Highest score on Humanity's Last Exam before end of year?", focusing on this specific metric rather than an independently verified concept of genuine understanding, with a resolution condition set before December 31, 2026 [^].

9. What Could Change the Odds

Key Catalysts

Key catalysts include the performance of AI models from major developers, specifically OpenAI GPT and Claude. Polymarket contracts resolve based on whether any OpenAI GPT model reaches a specified threshold on the official HLE leaderboard by June 30, 2026 11:59 PM ET [^]. Similarly, a Claude model's performance on the official HLE leaderboard at or above a specified percentage by June 30, 2026 11:59 PM ET is a resolution condition for another market example [^].
The dynamic nature of the Humanity's Last Exam (HLE) leaderboards means that ongoing updates and new model evaluations can lead to score shifts [^] [^] . The HLE timeline indicates a dynamic fork HLE-Rolling was released 2025-10-08, and HLE was published on Nature 2026-01-28, suggesting continued development and potential for new results [^]. Currently, Gemini 3.1 Pro Preview is a top leader, with scores like ~44.7% reported by Artificial Analysis and 46.44±1.96 on the Scale Labs leaderboard as of a 2026-05-06 update [^][^].

Key Dates & Catalysts

  • Strike Date: December 31, 2026
  • Expiration: December 31, 2026
  • Closes: December 31, 2026

10. Decision-Flipping Events

  • Trigger: Key catalysts include the performance of AI models from major developers, specifically OpenAI GPT and Claude.
  • Trigger: Polymarket contracts resolve based on whether any OpenAI GPT model reaches a specified threshold on the official HLE leaderboard by June 30, 2026 11:59 PM ET [^] .
  • Trigger: Similarly, a Claude model's performance on the official HLE leaderboard at or above a specified percentage by June 30, 2026 11:59 PM ET is a resolution condition for another market example [^] .
  • Trigger: The dynamic nature of the Humanity's Last Exam (HLE) leaderboards means that ongoing updates and new model evaluations can lead to score shifts [^] [^] .

12. Historical Resolutions

No historical resolution data available for this series.