First Proof Challenge Responses

The First Proof Challenge is a benchmark created by a group of mathematicians, designed to test the frontier of AI-assisted mathematical reasoning. The challenge posed a series of research-caliber mathematical problems (spanning number theory, combinatorics, analysis, and algebra), intended to be solvable via model prompting within roughly a week.

We produced the first known comprehensive comparison of frontier model responses across the full problem set, systematically evaluating how different large language models approach, formalize, and attempt to solve research-level mathematics.
Our analysis documents the strategies that succeed and the characteristic failure modes that emerge across model families, providing a structured account of where current AI reasoning stands relative to genuine mathematical research.
The work was conducted in collaboration with Param Thakkar and constitutes, to our knowledge, the first public, systematic benchmark-level comparative evaluation of LLM performance on the First Proof problem set.

A preprint of this work is available here.