CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

Files

Document Type

Poster

Publication Date

10-1-2025

Abstract

Large Language Models (LLMs) perform well on popular math benchmarks but still struggle with fundamental undergraduate tasks such as basic integrals. This suggests a diagnostic gap: existing datasets are either trivial, synthetic, or overly advanced, limiting their usefulness for exposing reasoning failures. To address this, we introduce CUMath, a benchmark of 2,100 real problems from undergraduate courses in Calculus, Linear Algebra, Differential Equations, and related fields. Each problem includes step-by-step solutions, enabling evaluation of both final answers and intermediate reasoning. Moreover, current evaluations treat accuracy and reasoning separately, overlooking their joint role in problem-solving. To address this, we propose a multi-layered evaluation framework that combines automatic metrics with an LLM-as-a-grader pipeline, integrating symbolic encoding and external verification. Using this setup, we evaluate 15 LLMs across various prompting strategies. Our results show that even advanced models often misuse symbolic methods and rely on shortcuts, leading to polished but flawed solutions. Our findings reveal the ongoing issue of inconsistent reasoning, highlighting the need for improved benchmarks, evaluation frameworks, and the development of models with enhanced consistency and reasoning capabilities. The code and data will be available upon publication.

Department

Department of Mathematical Sciences, DePauw University, Greencastle, IN

Project Mentor

Sutthirut Charoenphon

Funding and Acknowledgements

J. William Asher and Melanie J. Norton Endowed Fund in the Sciences

CUMath: A Benchmark and Evaluation Framework for LLMs on Mathematical Reasoning in Undergraduate Computational Math

Share

COinS