Improving Open-Source LLMs - Datasets, Merging and Stacking

Here at Abacus.AI, we are strong believers in the potential of open source LLMs. The gap in performance with the major commercial closed-sourced models has steadily been decreasing over the last 6 months, thanks to first Llama, then Llama 2, and more recently Mistral and Mixtral, among many others. We think this trend will continue; we think the future is one where everyone can have access to their own LLM and customise it as necessary for their own use cases.

The open source community has been incredibly innovative in the techniques they’ve been using to close the gap. Today we would like to highlight two techniques and our experiments and contributions with them.

The Importance of the Data

As the open source community has gained experience with improving LLM performance, one realisation has become clear: the quality of the training data matters tremendously for how the final model turns out. As such, many people have produced excellent new datasets for the community to utilise. One dataset we particularly like is MetaMath. MetaMath directly addresses one of the biggest weaknesses in open source LLMs – that they are bad at math and reasoning.

We noticed that the MetaMath dataset is always single-shot, so we constructed a new version of the dataset that we release called ‘MetaMathFewshot’. MetaMathFewshot has a random number of QA pairs in the prompt before being given the target question. Training on this dataset allows the model to understand the concept of few-shot prompting, a common usage pattern for end users interacting with LLMs.

Although we initially wanted to robustify our model to few-shot prompting, we submitted the model to the HuggingFace Open LLM Leaderboard and found that it had strong performance. We then trained a new version with MetaMath Fewshot mixed in with the OrcaChat and Vicuna datasets. The result is a model that scores (as of the time of this post) nearly 2% better on average than the best other non-merge (see below) MetaMath-based model, with substantial improvements in TruthfulQA and Winogrande, and even a slight improvement in GSM8K.

Our Fewshot-Metamath-OrcaVicuna-Mistral is the best non-merged model when sorted by GSM8K scores.

Stacking and Merging

As we see from the above, you can get big gains from changing the training dataset. However, finetuning is a costly enterprise; not everyone has access to a rack of the latest NVIDIA cards, or the financial resources to rent them for the potentially days to weeks of training time.

Enter techniques such as stacking and merging. These are ways to create new models without any training at all.

Stacking involves taking the layers of one model and ‘gluing’ them together with the layers of another model. For example, take layers 1-20 of a 30 layer model A, and concatenate them to layers 10-30 of a 30 layer model B; your final model will have 40 layers, 20 from each of the original models.

Merging is even simpler – simply take the weights of two (or more) separate models (with the same architecture, ideally) and combine them in some way to generate a new single set of weights. The method of combination can be as simple as just averaging the weights of the two models, or can be slightly more refined e.g. using SLERP to determine the weightings.

It may seem quite surprising that combining two different models with stacking or merging should produce anything reasonable, but it seems to be very effective on many LLMs. In fact, the top 7B models on the HF Leaderboard are nearly all merges of various different models. We were inspired to try our hand at this as well, and produced our own merged model here. At the time the model was submitted to the leaderboard, this was the second highest scoring 7B model on the list! In particular we took a model with high scores in subcategories of the leaderboard apart from TruthfulQA and GSM8K and SLERP-merged with a model with high scores in those categories, and were able to obtain a model with high scores in all subcategories, only slightly lower than the best of the original two models.

Models sorted by Average leaderboard score. Our Slerp-CM-mist-dpo is the second best 7B model as of time of this image.

More Recipes to Make More Soup

There are a plethora of new techniques that the open source community is experimenting with and discovering every day that we couldn’t cover yet in this short blog post, and we at Abacus are excited to contribute a few of our own ideas as well. Watch this space!