CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers or complete proofs, missing collaborative open-problem solving. CrowdMath is a dataset of 164 expert-annotated progress chains from the MIT PRIMES-AoPS CrowdMath program (2016-2025). Each chain tracks multi-participant forum discussions from problem statement to completed proof, with posts labeled by functional roles. Six frontier models achieve 83-88% accuracy on next-post prediction but only 0.42 macro-F1 on post-role classification, highlighting a gap in understanding collaborative mathematical progress.
[2606.06526] CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
[Submitted on 2 Jun 2026]
Title:CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions
View a PDF of the paper titled CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions, by Sherin Muckatira and 5 other authors
View PDF HTML (experimental)
Abstract:Large language models have made substantial progress on mathematical reasoning, but existing benchmarks typically evaluate well-specified problems with final answers, step-by-step solutions, or complete proofs. They do not capture collaborative open-problem solving: a setting in which participants propose partial arguments, identify gaps or errors in prior steps, repair flawed reasoning, and gradually synthesize incremental contributions into a proof. We introduce CrowdMath, a dataset of 164 expert-annotated progress chains from the MIT PRIMES--Art of Problem Solving (AoPS) CrowdMath program (2016-2025), a collaborative research initiative whose discussions have led to peer-reviewed publications. Each chain traces a multi-participant forum discussion from an open-problem statement to a completed proof. Posts are labeled by their functional roles in the evolving solution process, including partial progress, proof completion, erroneous reasoning, and error identification. We define evaluation tasks and benchmark six frontier models. Models achieve 83-88% accuracy on next-post prediction, suggesting that they can follow the local flow of mathematical discussion. However, they struggle to identify the functional significance of individual contributions with the best model achieving only 0.42 macro-F1 on post-role classification. CrowdMath exposes a gap between solving well-specified mathematical problems and understanding collaborative mathematical progress as it unfolds.
Comments: 16 pages, 4 figures
Subjects:
Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as: arXiv:2606.06526 [cs.AI]
(or arXiv:2606.06526v1 [cs.AI] for this version)
https://doi.org/10.48550/arXiv.2606.06526
arXiv-issued DOI via DataCite
Submission history
From: Sherin Muckatira [view email] [v1] Tue, 2 Jun 2026 20:38:39 UTC (1,074 KB)
Full-text links:
Access Paper:
View a PDF of the paper titled CrowdMath: A Dataset of Crowdsourced Mathematical Research Discussions, by Sherin Muckatira and 5 other authors
View PDF
HTML (experimental)
TeX Source
view license
Current browse context:
cs.AI
new | recent | 2026-06
Change to browse by:
cs cs.LG
References & Citations
NASA ADS
Google Scholar
Semantic Scholar
Loading...
Data provided by:
Bibliographic Tools
Bibliographic and Citation Tools
Bibliographic Explorer Toggle
Bibliographic Explorer (What is the Explorer?)
Connected Papers Toggle
Connected Papers (What is Connected Papers?)
Litmaps Toggle
Litmaps (What is Litmaps?)
scite.ai Toggle
scite Smart Citations (What are Smart Citations?)
Code, Data, Media
Code, Data and Media Associated with this Article
alphaXiv Toggle
alphaXiv (What is alphaXiv?)
Links to Code Toggle
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub Toggle
DagsHub (What is DagsHub?)
GotitPub Toggle
Gotit.pub (What is GotitPub?)
Huggingface Toggle
Hugging Face (What is Huggingface?)
ScienceCast Toggle
ScienceCast (What is ScienceCast?)
Demos
Demos
Replicate Toggle
Replicate (What is Replicate?)
Spaces Toggle
Hugging Face Spaces (What is Spaces?)
Spaces Toggle
TXYZ.AI (What is TXYZ.AI?)
Related Papers
Recommenders and Search Tools
Link to Influence Flower
Influence Flower (What are Influence Flowers?)
Core recommender toggle
CORE Recommender (What is CORE?)
Author
Venue
Institution
Topic
About arXivLabs
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
Which authors of this paper are endorsers? | Disable MathJax (What is MathJax?)