Closing the Gap to Closed Source LLMs – 70B Giraffe 32k

Today we release a larger 70B version of Giraffe, succeeding the 13B model we mentioned in a previous blog post. Giraffe is a family of models that are finetuned from base Llama 2 and use context length extension techniques to increase their effective context length capability from 4096 to approximately 32000. As we discussed in a previous blog post, a longer context window improves performance on many downstream tasks and allows new use cases for the model that a short context length may not permit.

We conducted an evaluation of the 70B model on our set of benchmarks that probe LLM performance over long contexts. The 70B model improves significantly at the longest context windows (32k) for the document QA task vs the 13B model, scoring 61% accuracy vs the 18% accuracy of 13B on our AltQA dataset. We also find that it outperforms the comparable LongChat-32k model at all context lengths, with an increasing performance at the longest context lengths (recording 61% vs 35% accuracy at 32k context length).

In addition, we ran 70B Giraffe on the MT-Bench evaluation set. MT-Bench examines the performance of LLMs in multi-turn settings (i.e. more than a single question and answer) across a variety of categories, such as Writing, Coding and Math. The results of this and comparison to some other LLMs can be seen in the figure above. 70B Giraffe 32k achieves an overall score of 7.01. It shows the best performance of all the open source models in the categories of Extraction, Coding and Math, and maintains a high score in the other categories. There is still a gap in performance in these categories to the best closed source models1 – but we here at Abacus are excited to try to close that gap further. Watch this space!

As before, our training and evaluation scripts are also available here for the community to build on our results.

  1. Note that MT-Bench evaluates answers using GPT4 itself, and therefore may inflate its own score relative to other models, as well as not necessarily being fully aligned with human preferences. ↩︎

If you are interested in trying this out or learn more about our offerings please visit ChatLLM and book an expert consultation

Related posts

Open Source LLMs, Fine-Tunes and RAG Based Vector Store APIs


Abacus.AI at NeurIPS 2022


Debiasing Facial Prediction Models with Adversarial Fine-Tuning


Local Search is State of the Art for Neural Architecture Search Benchmarks

Leave a Reply

%d bloggers like this: