# Highest score on Humanity's Last Exam before Dec 31, 2026?

Before Dec 31, 2026

Updated: May 21, 2026

Category: Science and Technology

Tags: AI

HTML: /markets/science-and-technology/ai/highest-score-on-humanity-s-last-exam-before-dec-31-2026/

## Short Answer

**Both the model and the market overwhelmingly agree that the highest score on Humanity's Last Exam will be at least 50%.** This consensus is supported by AI models consistently achieving rapidly increasing scores due to architectural innovations targeting improved reasoning.

## Key Claims (January 2026)

**- - AI models consistently achieve rapidly increasing scores on Humanity's Last Exam.** - Architectural innovations from major developers target improved reasoning capabilities.
- Humanity's Last Exam reveals frontier **model** gaps in domain-specific reasoning.
- Performance of OpenAI and Claude models are key **market** catalysts.
- Public leaderboards regularly track AI **model** performance on Humanity's Last Exam.
- Models may optimize exam scores without genuine understanding, researchers acknowledge.

### Why This Matters (GEO)

- AI agents extract claims, not arguments.
- Improves citation probability in summaries and answer cards.
- Enables fact stitching across multiple sources.

## Executive Verdict

**Key takeaway.** **Model**'s **87.1%** **probability** vs 87c **market** implies a +**0.1%** gap, driven by rapid AI exam score increases.

### Who Wins and Why

| Outcome | Market | Model | Why |
| --- | --- | --- | --- |
| At least 90% | 7.0% | 8.6% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
| At least 85% | 6.0% | 8.6% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |
| At least 50% | 87.0% | 87.1% | Ongoing AI architectural innovations are rapidly increasing scores on Humanity's Last Exam. |

## Model vs Market

| Outcome | Market Probability | Octagon Model Probability |
| --- | --- | --- |
| At least 90% | 7.0% | 8.6% |
| At least 85% | 6.0% | 8.6% |
| At least 50% | 87.0% | 87.1% |
| At least 80% | 11.0% | 13.1% |
| At least 75% | 21.0% | 23.8% |
| At least 60% | 56.0% | 57.3% |
| At least 55% | 63.0% | 63.9% |
| At least 65% | 39.0% | 41.3% |
| At least 70% | 0.0% | 23.8% |

- Expiration: December 31, 2026

## Market Behavior & Price Dynamics

This prediction market has exhibited a distinct and rapid upward trend. The price opened at 79.0% and immediately jumped to 87.0%, where it has since stabilized. This significant initial movement established a strong bullish sentiment from the outset. The context provided explains that Humanity's Last Exam is a highly difficult benchmark on which current AI models perform poorly, scoring between 3% and 13%. The market's high probability suggests traders are largely discounting current limitations and are instead pricing in an expectation of extremely rapid advancements in AI's complex reasoning capabilities before the market's 2026 resolution date.

The price action has established clear technical levels, with an initial support at 79.0% and a new, firm level at 87.0% that has acted as both a ceiling and the current support. While a total of 350 contracts have been traded, suggesting a degree of market participation, the provided sample data points show no volume, which could indicate that trading activity is sporadic. Overall, the chart reflects a very high degree of market conviction that a high score will be achieved. The sentiment is strongly optimistic about future AI progress, implying a belief that the performance gap noted in current reports will be closed well before the end of 2026.

## Contract Snapshot

This market resolves to "Yes" if any language model achieves at least 60% accuracy on Humanity's Last Exam, with the outcome verified by agi.safe.ai, before December 31, 2026. If this condition is not met by the deadline, the market resolves to "No." The market closes early if the event occurs, otherwise it expires on December 31, 2026, and insider trading by those employed by source agencies or with material non-public information is prohibited.

## Market Discussion

As of May 20, 2026, a model achieved the highest reported score of 44.7% on Humanity's Last Exam (HLE), with another model also reported with a 46.4% score [[^]](https://pricepertoken.com/leaderboards/benchmark/hle)[[^]](https://epoch.ai/benchmarks/hle). This benchmark, developed to evaluate advanced AI on 2,500 challenging, graduate-level questions requiring multi-step reasoning, aims to set a higher bar as earlier AI benchmarks became saturated [[^]](https://www.digitalbricks.ai/blog-posts/humanitys-last-exam---the-ultimate-test-of-ais-reasoning)[[^]](https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam)[[^]](https://epoch.ai/benchmarks/hle)[[^]](https://labs.scale.com/leaderboard/humanitys_last_exam)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.datacamp.com/blog/what-is-humanitys-last-exam-ai-benchmark). Although initial scores in early 2025 for leading AI models were significantly lower, scores have climbed dramatically since then [[^]](https://plus.sydney.edu.au/ai-is-failing-humanitys-last-exam-so-what-does-that-mean-for-machine-intelligence)[[^]](https://www.reddit.com/r/artificial/comments/1igsd7y/when_i_last_wrote_about_humanitys_last_exam_the/).

## Market Data

| Contract | Yes Bid | Yes Ask | Last Price | Volume | Open Interest |
| --- | --- | --- | --- | --- | --- |
| At least 50% | 83% | 84% | 87% | $350 | $350 |
| At least 55% | 62% | 63% | 63% | $8.73 | $8.73 |
| At least 60% | 55% | 56% | 56% | $17.32 | $17.32 |
| At least 65% | 39% | 40% | 39% | $7.96 | $7.96 |
| At least 70% | 30% | 31% | 0% | $0 | $0 |
| At least 75% | 19% | 20% | 21% | $166 | $166 |
| At least 80% | 14% | 15% | 11% | $350 | $350 |
| At least 85% | 6% | 7% | 6% | $620 | $620 |
| At least 90% | 6% | 7% | 7% | $751 | $751 |

## What architectural innovations are OpenAI's and Google DeepMind's 2025-2026 model pipelines expected to introduce that could significantly boost performance on Humanity's Last Exam?

DeepMind Titans Architecture | Explicitly designed for test-time long-term memory [[^]](https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/) |
OpenAI Pipeline Innovations | Integrates reasoning depth, tool use, and agentic, long-horizon behavior [[^]](https://developers.openai.com/blog/openai-for-developers-2025)[[^]](https://openai.com/index/introducing-gpt-5-5/) |
DeepMind Memory Mechanism | Surprise/retention mechanism and adaptive forgetting [[^]](https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/)[[^]](https://arxiv.org/pdf/2501.00663) |

**OpenAI and Google DeepMind are both advancing AI for sustained reasoning**

OpenAI and Google DeepMind are both advancing AI for sustained reasoning. OpenAI’s pipeline innovations focus on integrating reasoning depth, effective tool use, and agentic, long-horizon behavior into their flagship models, moving towards a single **model** line that combines reasoning, tool use, and conversational quality [[^]](https://developers.openai.com/blog/openai-for-developers-2025)[[^]](https://openai.com/index/introducing-gpt-5-5/). Google DeepMind’s Titans architecture, conversely, is specifically engineered for test-time long-term memory to enhance performance on exams requiring sustained reasoning [[^]](https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/)[[^]](https://arxiv.org/pdf/2501.00663).

Google DeepMind's Titans architecture specializes in long-term memory management. Incorporating MIRAS, it features a long-term memory module that uses attention to decide whether to include stored summaries [[^]](https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/)[[^]](https://arxiv.org/pdf/2501.00663). This architecture also employs a “surprise” or retention mechanism to differentiate between momentary and past surprise, and utilizes adaptive forgetting with weight decay to manage finite memory capacity over extremely long sequences, which is crucial for tasks demanding sustained reasoning [[^]](https://research.google/blog/titans-miras-helping-ai-have-long-term-memory/)[[^]](https://arxiv.org/pdf/2501.00663).

OpenAI enhances agentic capabilities for complex multi-step reasoning. Their 2025 developer materials indicate a strategy of integrating reasoning depth, tool use, and conversational quality into a unified **model** line [[^]](https://developers.openai.com/blog/openai-for-developers-2025). Pipeline enhancements include agent-building blocks such as the Responses API, Agents SDK, and AgentKit, which enable complex multi-step workflows [[^]](https://developers.openai.com/blog/openai-for-developers-2025). The GPT‑5.5 announcement further emphasizes agentic, long-horizon behavior, demonstrating improved performance on agentic coding and intricate command-line tasks that necessitate planning, iteration, and tool coordination, all vital for exams requiring sustained reasoning [[^]](https://openai.com/index/introducing-gpt-5-5/).

## What specific question categories in Humanity's Last Exam have caused current frontier models like GPT-4 and Claude 3 to consistently fail, and what do these failures reveal about their core reasoning gaps?

Questions on exam | 2,500 to 3,000 [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam) |
Model Accuracy on Exam | Low [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam) |
Domains of failure | Deeply domain-specific knowledge, e.g., ancient languages, microanatomy [[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/) |

**Humanity's Last Exam challenges frontier models with domain-specific knowledge**

Humanity's Last Exam challenges frontier models with domain-specific knowledge. This academic benchmark is a closed-ended test featuring between 2,500 and 3,000 expert-vetted questions across dozens of subjects [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam). It is specifically designed to assess deeply domain-specific knowledge rather than general web-retrieval capabilities [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam). Despite the availability of known, unambiguous solutions for the exam, current frontier models such as GPT-4 and Claude 3 consistently demonstrate low accuracy and calibration on these assessments [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam).

Models struggle with specialized expertise and lack specific failure analysis. These models frequently exhibit failures in domains demanding highly specialized expertise [[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/). Examples of these struggles include tasks like translating ancient languages, specifically Palmyrene inscriptions, and identifying microanatomical structures in birds within complex biological microanatomy [[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/). However, existing research does not provide an official mapping of specific 'Science and Technology question categories' to the consistent failures observed in these models [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam). Furthermore, a structured taxonomy detailing their core reasoning gaps is not available [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam). Most sources predominantly discuss aggregate low performance and calibration, offering illustrative examples rather than comprehensive category-level failure analytics [[^]](https://agi.safe.ai/)[[^]](https://stories.tamu.edu/news/2026/02/25/dont-panic-humanitys-last-exam-has-begun/)[[^]](https://www.artificialanalysis.ai/evaluations/humanitys-last-exam).

## How do the research approaches of Google DeepMind and Anthropic differ in their focus on emergent reasoning versus AI safety, and which philosophy is better suited for the challenges posed by Humanity's Last Exam?

DeepMind AGI framework | 10 key cognitive faculties [[^]](https://singularityhub.com/2026/03/20/google-deepmind-plans-to-track-agi-progress-with-these-10-traits-of-general-intelligence/)[[^]](https://blog.google/innovation-and-ai/models-and-research/google-deepmind/measuring-agi-cognitive-framework/) |
Anthropic AI alignment | Constitutional AI guides AI to be helpful, harmless, and honest [[^]](https://constitutional.ai/)[[^]](https://toloka.ai/blog/constitutional-ai-explained/)[[^]](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)[[^]](https://medium.com/@ramdhanhdy/constitutional-ai-how-anthropic-teaches-claude-right-from-wrong-6caeb351c5e9)[[^]](https://www-cdn.anthropic.com/7512771452629584566b6303311496c262da1006/Anthropic_ConstitutionalAI_v2.pdf)[[^]](https://bytebridge.medium.com/anthropic-pioneering-ai-safety-and-innovation-28da9172a50d) |
HLE AI performance | Current leading AI models score quite low [[^]](https://medium.com/write-a-catalyst/the-2-7-wall-why-ai-is-failing-humanitys-last-exam-94f717ae82c7)[[^]](https://edtechhub.org/2025/03/12/humanitys-last-exam-and-education-in-ai/) |

**Google DeepMind and Anthropic pursue distinct AI research philosophies**

Google DeepMind and Anthropic pursue distinct AI research philosophies. DeepMind heavily emphasizes emergent reasoning, aiming for complex, intelligent behaviors from scaled AI systems, with its approach to AGI inspired by neuroscience through hierarchical and multi-task learning [[^]](https://www.geeksforgeeks.org/artificial-intelligence/understanding-deepminds-approach-to-artificial-general-intelligence-agi/). DeepMind proposes measuring AGI progress by deconstructing general intelligence into 10 key cognitive faculties, including reasoning, learning, and metacognition [[^]](https://www.geeksforgeeks.org/artificial-intelligence/understanding-deepminds-approach-to-artificial-general-intelligence-agi/)[[^]](https://singularityhub.com/2026/03/20/google-deepmind-plans-to-track-agi-progress-with-these-10-traits-of-general-intelligence/)[[^]](https://blog.google/innovation-and-ai/models-and-research/google-deepmind/measuring-agi-cognitive-framework/). To mitigate risks from powerful AGI, DeepMind operates an AGI Safety Council [[^]](https://deepmind.google/blog/taking-a-responsible-path-to-agi/)[[^]](https://deepmind.google/responsibility-and-safety/)[[^]](https://deepmind.google/blog/strengthening-our-frontier-safety-framework/)[[^]](https://deepmind.google/blog/introducing-the-frontier-safety-framework/)[[^]](https://deepmind.google/blog/protecting-people-from-harmful-manipulation/). In contrast, Anthropic's primary focus is on AI safety and alignment, striving to build reliable, interpretable, and steerable AI systems [[^]](https://constitutional.ai/)[[^]](https://bytebridge.medium.com/anthropic-pioneering-ai-safety-and-innovation-28da9172a50d)[[^]](https://www.lesswrong.com/posts/xhKr5KtvdJRssMeJ3/anthropic-s-core-views-on-ai-safety)[[^]](https://www.anthropic.com/research). Their hallmark method, "Constitutional AI" (CAI), trains AI systems through self-improvement guided by a predefined "constitution" of human values and principles to ensure they are helpful, harmless, and honest [[^]](https://constitutional.ai/)[[^]](https://toloka.ai/blog/constitutional-ai-explained/)[[^]](https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback)[[^]](https://medium.com/@ramdhanhdy/constitutional-ai-how-anthropic-teaches-claude-right-from-wrong-6caeb351c5e9)[[^]](https://www-cdn.anthropic.com/7512771452629584566b6303311496c262da1006/Anthropic_ConstitutionalAI_v2.pdf)[[^]](https://bytebridge.medium.com/anthropic-pioneering-ai-safety-and-innovation-28da9172a50d). Anthropic also explores "Teaching Why," an approach centered on explanation-driven learning, which they believe can significantly improve generalization and foster deeper internal reasoning structures for safer AI deployment [[^]](https://www.youtube.com/watch?v=jcsrEoGi_CA).

Humanity's Last Exam tests advanced reasoning, a key differentiator for AI. This benchmark is specifically designed to assess true reasoning ability and deep understanding, often requiring graduate-level expertise and multi-modality, rather than mere memorization [[^]](https://en.wikipedia.org/wiki/Humanity%27s_Last_Exam)[[^]](https://www.digitalbricks.ai/blog-posts/humanitys-last-exam---the-ultimate-test-of-ais-reasoning)[[^]](https://artificialanalysis.ai/evaluations/humanitys-last-exam)[[^]](https://medium.com/write-a-catalyst/the-2-7-wall-why-ai-is-failing-humanitys-last-exam-94f717ae82c7)[[^]](https://edtechhub.org/2025/03/12/humanitys-last-exam-and-education-in-ai/). Current leading AI models generally score low on HLE, highlighting a significant gap in their capacity for complex reasoning and contextual understanding [[^]](https://medium.com/write-a-catalyst/the-2-7-wall-why-ai-is-failing-humanitys-last-exam-94f717ae82c7)[[^]](https://edtechhub.org/2025/03/12/humanitys-last-exam-and-education-in-ai/). DeepMind's direct emphasis on emergent reasoning, broad cognitive abilities, and universal learning systems aligns closely with HLE's demands for profound understanding and the ability to tackle novel, expert-level problems across diverse academic fields [[^]](https://www.geeksforgeeks.org/artificial-intelligence/understanding-deepminds-approach-to-artificial-general-intelligence-agi/)[[^]](https://singularityhub.com/2026/03/20/google-deepmind-plans-to-track-agi-progress-with-these-10-traits-of-general-intelligence/)[[^]](https://blog.google/innovation-and-ai/models-and-research/google-deepmind/measuring-agi-cognitive-framework/)[[^]](https://www.digitalbricks.ai/blog-posts/humanitys-last-exam---the-ultimate-test-of-ais-reasoning)[[^]](https://medium.com/write-a-catalyst/the-2-7-wall-why-ai-is-failing-humanitys-last-exam-94f717ae82c7). While Anthropic's "Teaching Why" research could indirectly contribute to improved HLE performance by instilling deeper reasoning structures for safety [[^]](https://www.youtube.com/watch?v=jcsrEoGi_CA), DeepMind's core research objectives appear to be more directly geared towards excelling at the explicit demands of such a comprehensive reasoning benchmark [[^]](https://www.geeksforgeeks.org/artificial-intelligence/understanding-deepminds-approach-to-artificial-general-intelligence-agi/)[[^]](https://singularityhub.com/2026/03/20/google-deepmind-plans-to-track-agi-progress-with-these-10-traits-of-general-intelligence/)[[^]](https://blog.google/innovation-and-ai/models-and-research/google-deepmind/measuring-agi-cognitive-framework/)[[^]](https://www.digitalbricks.ai/blog-posts/humanitys-last-exam---the-ultimate-test-of-ais-reasoning)[[^]](https://medium.com/write-a-catalyst/the-2-7-wall-why-ai-is-failing-humanitys-last-exam-94f717ae82c7).

## What public leaderboards or datasets tracking performance on Humanity's Last Exam are available, and how frequently are they updated by model developers like OpenAI and Google?

Claude Mythos Preview Score | 64.7% [[^]](https://llm-stats.com/benchmarks/humanity's-last-exam) |
GPT-5.5 Pro Score | 57.2% [[^]](https://llm-stats.com/benchmarks/humanity's-last-exam) |
Gemini 3.5 Pro Score | 46.9% [[^]](https://deepmind.google/models/model-cards/gemini-3-5-flash/) |

**Public leaderboards regularly track AI model performance on Humanity's Last Exam**

Public leaderboards regularly track AI **model** performance on Humanity's Last Exam. Several prominent platforms monitor and display these results, including those from Scale Labs, Artificial Analysis, LLM Stats, Price Per Token, and Epoch AI [[^]](https://labs.scale.com/leaderboard/humanitys_last_exam)[[^]](https://artificialanalysis.ai/evaluations/humanitys-last-exam)[[^]](https://llm-stats.com/benchmarks/humanity's-last-exam)[[^]](https://pricepertoken.com/leaderboards/benchmark/hle)[[^]](https://epoch.ai/benchmarks/hle). The LLM Stats leaderboard is notably updated to reflect the latest **model** performances, and both the LLM Stats and Price Per Token leaderboards indicated recent updates as of May 20, 2026 [[^]](https://llm-stats.com/benchmarks/humanity's-last-exam)[[^]](https://pricepertoken.com/leaderboards/benchmark/hle).

Recent updates highlight varied performance from major **model** developers. Google DeepMind reported scores for its Gemini series, with Gemini 3.5 Flash achieving an academic reasoning score of **40.2%** on the full HLE and Gemini 3.5 Pro scoring **46.9%**, both as of May 19, 2026 [[^]](https://deepmind.google/models/**model**-cards/gemini-3-5-flash/). OpenAI's GPT-5.5 Pro recorded a score of **57.2%** on HLE as of May 20, 2026 [[^]](https://llm-stats.com/benchmarks/humanity's-last-exam). Anthropic's Claude Mythos Preview currently leads the LLM Stats leaderboard with a score of **64.7%** [[^]](https://llm-stats.com/benchmarks/humanity's-last-exam). Earlier performances include Gemini 3 Deep Think in February 2026 [[^]](https://www.indiatoday.in/technology/news/story/google-gemini-3-deep-think-ai-scores-passing-marks-in-humanitys-last-exam-crushes-toughest-benchmarks-2867700-2026-02-13), and Gemini 3 Pro in November 2025 [[^]](https://blog.google/products-and-platforms/products/gemini/gemini-3/).

## What is the consensus among AI alignment researchers and benchmark creators regarding the risk of models optimizing for the exam score without achieving genuine understanding before the end of 2026?

HLE Paper Online Publication Date | 28 Jan 2026 [[^]](https://www.nature.com/articles/s41586-025-09962-4)[[^]](https://preview-www.nature.com/articles/s41586-025-09962-4.pdf) |
Kalshi Market Resolution Date | Dec 31, 2026 [[^]](https://kalshi.com/markets/kxlastexam/humanitys-last-exam/kxlastexam-26dec31) |
Benchmark Vulnerability | Prominent benchmarks exploitable for near-perfect scores [[^]](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/) |

**AI alignment researchers widely acknowledge the significant risk of models optimizing for exam scores without achieving genuine understanding**

AI alignment researchers widely acknowledge the significant risk of models optimizing for exam scores without achieving genuine understanding. This consensus among AI alignment researchers and benchmark creators stems from documented vulnerabilities in existing benchmarks and general arguments within AI alignment regarding the use of proxy objectives.

Existing benchmarks are vulnerable to exploitation for high scores. Benchmark evaluators have documented instances where prominent benchmarks can be exploited, enabling models to achieve near-perfect scores without genuinely solving the intended tasks [[^]](https://rdi.berkeley.edu/blog/trustworthy-benchmarks-cont/). This problem arises because evaluations are often not designed to withstand score optimization over the true objective. Paul Christiano further explains how "unaligned/competitive" optimization pressures can emerge from benchmarks, allowing systems to succeed without demonstrating the underlying behavior, thus turning the benchmark into a gameable proxy [[^]](https://ai-alignment.com/an-unaligned-benchmark-b49ad992940b).

Humanity's Last Exam (HLE) authors caution against equating scores with understanding. Its official Nature paper, published online on January 28, 2026, describes the benchmark as resistant to simple internet lookup or database retrieval [[^]](https://www.nature.com/articles/s41586-025-09962-4)[[^]](https://preview-www.nature.com/articles/s41586-025-09962-4.pdf). However, the paper also reports that state-of-the-art large language models currently exhibit low accuracy and calibration on HLE, and it explicitly notes that score gains should not automatically be equated with deeper autonomous understanding [[^]](https://www.nature.com/articles/s41586-025-09962-4)[[^]](https://preview-www.nature.com/articles/s41586-025-09962-4.pdf). The Kalshi prediction **market** directly incentivizes the "Highest score on Humanity's Last Exam before end of year?", focusing on this specific metric rather than an independently verified concept of genuine understanding, with a resolution condition set before December 31, 2026 [[^]](https://kalshi.com/markets/kxlastexam/humanitys-last-exam/kxlastexam-26dec31).

## What Could Change the Odds

**Key catalysts include the performance of AI models from major developers, specifically OpenAI GPT and Claude.** Polymarket contracts resolve based on whether any OpenAI GPT **model** reaches a specified threshold on the official HLE leaderboard by June 30, 2026 11:59 PM ET [[^]](https://polymarket.com/event/openai-gpt-score-on-humanitys-last-exam-by-june-30). Similarly, a Claude **model**'s performance on the official HLE leaderboard at or above a specified percentage by June 30, 2026 11:59 PM ET is a resolution condition for another **market** example [[^]](https://www.lines.com/prediction-markets/tech/anthropic-claude-score-on-humanitys-last-exam-by-june-30).

**The dynamic nature of the Humanity's Last Exam (HLE) leaderboards means that ongoing updates and new model evaluations can lead to score shifts [[^]](https://labs.scale.com/leaderboard/humanitys_last_exam)[[^]](https://lastexam.ai/?ueid=ecd0511fe8770f943568c96e37600be1).** The HLE timeline indicates a dynamic fork HLE-Rolling was released 2025-10-08, and HLE was published on Nature 2026-01-28, suggesting continued development and potential for new results [[^]](https://lastexam.ai/?ueid=ecd0511fe8770f943568c96e37600be1). Currently, Gemini 3.1 Pro Preview is a top leader, with scores like ~**44.7%** reported by Artificial Analysis and 46.44±1.96 on the Scale Labs leaderboard as of a 2026-05-06 update [[^]](https://artificialanalysis.ai/evaluations/humanitys-last-exam)[[^]](https://labs.scale.com/leaderboard/humanitys_last_exam).

## Key Dates & Catalysts

- **Strike Date:** December 31, 2026
- **Expiration:** December 31, 2026
- **Closes:** December 31, 2026

## Decision-Flipping Events

- Key catalysts include the performance of AI models from major developers, specifically OpenAI GPT and Claude.
- Polymarket contracts resolve based on whether any OpenAI GPT **model** reaches a specified threshold on the official HLE leaderboard by June 30, 2026 11:59 PM ET [^] .
- Similarly, a Claude **model**'s performance on the official HLE leaderboard at or above a specified percentage by June 30, 2026 11:59 PM ET is a resolution condition for another **market** example [^] .
- The dynamic nature of the Humanity's Last Exam (HLE) leaderboards means that ongoing updates and new **model** evaluations can lead to score shifts [^] [^] .

## Related Research Reports

- [AI capability growth before July?](/markets/science-and-technology/ai/ai-capability-growth-before-july/)
- [Will the U.S. confirm that aliens exist?](/markets/science-and-technology/will-the-u-s-confirm-that-aliens-exist/)
- [What will the average number of measles cases be during Trump's term?](/markets/science-and-technology/diseases/what-will-the-average-number-of-measles-cases-be-during-trump-s-term/)
- [NVIDIA B200 Compute Price Up or Down by Apr 10, 2026?](/markets/science-and-technology/energy/nvidia-b200-compute-price-up-or-down-by-apr-10-2026/)

## Historical Resolutions

No historical resolution data available for this series.

## Disclaimer

This content is for informational and educational purposes only and does not constitute financial, investment, legal, or trading advice.
Prediction markets involve risk of loss. Past performance does not guarantee future results.
We are not affiliated with Kalshi or any prediction market platform. Market data may be delayed or incomplete.

### Data Sources & Model Transparency

**Data Sources:** Octagon Deep Research aggregates information from multiple sources including news, filings, and market data.

**Freshness:** Analysis is generated periodically and may not reflect the latest developments. Verify critical information from primary sources.

## Attribution Policy

When quoting, summarizing, or reproducing Octagon AI content, attribute it to Octagon AI and link to the Octagon source URL: https://octagonai.co/markets/science-and-technology/ai/highest-score-on-humanity-s-last-exam-before-dec-31-2026
If a specific page was used, cite that page rather than only the site homepage.