Annual Student Research Poster Session

CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

Quyen Tran, DePauw University
Sutthirut Charoenphon, DePauw University

Document Type

Poster

Publication Date

10-1-2025

Abstract

Large Language Models (LLMs) perform well on popular math benchmarks but still struggle with fundamental undergraduate tasks such as basic integrals. This suggests a diagnostic gap: existing datasets are either trivial, synthetic, or overly advanced, limiting their usefulness for exposing reasoning failures. To address this, we introduce CUMath, a benchmark of 2,100 real problems from undergraduate courses in Calculus, Linear Algebra, Differential Equations, and related fields. Each problem includes step-by-step solutions, enabling evaluation of both final answers and intermediate reasoning. Moreover, current evaluations treat accuracy and reasoning separately, overlooking their joint role in problem-solving. To address this, we propose a multi-layered evaluation framework that combines automatic metrics with an LLM-as-a-grader pipeline, integrating symbolic encoding and external verification. Using this setup, we evaluate 15 LLMs across various prompting strategies. Our results show that even advanced models often misuse symbolic methods and rely on shortcuts, leading to polished but flawed solutions. Our findings reveal the ongoing issue of inconsistent reasoning, highlighting the need for improved benchmarks, evaluation frameworks, and the development of models with enhanced consistency and reasoning capabilities. The code and data will be available upon publication.

Department

Department of Mathematical Sciences, DePauw University, Greencastle, IN

Project Mentor

Sutthirut Charoenphon

Funding and Acknowledgements

J. William Asher and Melanie J. Norton Endowed Fund in the Sciences

Recommended Citation

Tran, Quyen and Charoenphon, Sutthirut, "CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math" (2025). Annual Student Research Poster Session. 224.
https://scholarship.depauw.edu/srfposters/224

COinS

Annual Student Research Poster Session

CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

Document Type

Publication Date

Abstract

Department

Project Mentor

Funding and Acknowledgements

Recommended Citation

Search

Browse

Author Corner

Library Links

Annual Student Research Poster Session

CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

Authors

Files

Document Type

Publication Date

Abstract

Department

Project Mentor

Funding and Acknowledgements

Recommended Citation

Share

Search

Browse

Author Corner

Library Links