Sharper LLMs: Enhancing Math and Reasoning Abilities

Today we release a new model in our continued commitment to improving the state-of-the-art in open source LLMs: MetaMath-Bagel-DPO-34B. This time, we’ve refined our focus on enhancing mathematical and reasoning capabilities in LLMs by primarily targeting the improvement of GSM8k scores without compromising performance on other benchmarks. We’ve adopted strategies involving data enrichment and interleaved training techniques. This new model builds on our previous work by applying our MetaMath Fewshot dataset to the excellent Bagel models released by jondurbin, which in turn are fine-tunes of Yi 34B and Mixtral.

Our new model largely maintains the performance of the Bagel model across the board but lifts GSM8K by nearly 13%, resulting in an overall improvement of about 1% on average. Our model is the second best model in the 34B category on the Open LLM Leaderboard. (*Rank 1 based on our internal evaluations)

Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8k
75.54	69.20	84.34	76.46	67.58	82.87	72.78

In the process of training these models, we have gained some insights which we share below so that others can continue to build on this work and keep pushing the envelope on open source research.

Path to Performance

Our goal was to amplify and elevate the GSM8K scores — a widely accepted gold standard in measuring an LLM’s performance in math and reasoning tasks — without compromising other essential metrics on the HuggingFace Open LLM Leaderboard. To demonstrate our approach, we show improvements in the Bagel fine-tuned models released by jondurbin — LLMs that showcased promising performance but lagged in their GSM8K score.

Sharpening LLM Reasoning

The first step of the training process was a supervised fine-tuning (SFT) run using the MetaMathFewshot, Orca and ShareGPT datasets starting from a bagel SFT base model. Whilst this did improve the GSM8K score significantly, we found that this alone was not sufficient to come close to DPO-tuned models in the same class.

Interleaving DPO and SFT

Driven by the promising results of this first experiment, we decided to repeat the process with a post-DPO model. However, doing so lost performance in TruthfulQA and ARC, primarily because these high benchmark scores are centered around DPO training. Thus, we followed up with a second round of DPO after our SFT step. This technique proved effective, as it not only retained the GSM8K boost but also managed to restore drops in other metrics to a large extent.

Key Observations and The Road Ahead

This recent series of experiments allowed us to make valuable observations.

First, datasets with rich reasoning information consistently catalyzed improvements in GSM8K performance.
Second, interleaving DPO and SFT stages has potential for harmonizing multiple performance metrics.
Lastly, we found that DPO does not necessarily have to be the final step—it can be alternated with SFT stages to finely balance model performances.

At Abacus.AI, we are excited to continue partnering with the open source AI community and drive the performance of LLMs. Stay tuned as we continue inspiring and shaping the future of AI.

Contributions