Treating Attention Deficit Disorder in LLMs

We have seen an explosion of open-source LLMs lately. While these OSS LLMs have shown comparable performance to closed source LLM APIs offered by OpenAI, Google and others, they suffer from one serious limitation. They only support context lengths of 2K compared to closed source LLM APIs that support 8K or greater contexts. This translates to them not being very useful when it comes to creating a Custom LLM based on your knowledge base. You can’t send the LLM much data in a single call, which has a very negative effect on model performance.

We are releasing for research only purposes a version of Llama v1 extended to 8k and even 16k which seems to retain accuracy on evaluation tasks designed to match real world applications. For those interested in experimenting we are releasing delta weights for AbaGiraffe v1 under the same license as Llama v1. Additionally we are also sharing some of the evaluation datasets. The same techniques could be applied to any other foundation model and so (of course) we are repeating the above work with Llama v2. We hope to be able to release the weights for that soon.

There are some tasks which even relatively small (~10B) models are quite effective at solving using information supplied in the prompt. The key limitation in these tasks is often up the amount of data we can stuff into the prompt. Having a limited context leads to complicated prompt engineered solutions that require iterative generation. This has a lot of downsides, for example it generally leads to lower quality and is slow. Having a longer context greatly simplifies the problem in a number of applications. Others in the community have come to the same conclusion and there have been interesting posts discussing various methods to achieve this goal. Separately, researchers have published papers on alternative position embeddings that are possibly better suited to extrapolation. We have been experimenting with a variety of approaches and thought that sharing our results would be helpful for others exploring this issue.

To this end we are sharing a repository that contains code and tooling specific to our fine tuning for context extension projects. Perhaps of equal interest, we have also included some evaluation scripts and datasets specifically targeting the question of whether the model is able to preserve its capabilities as the context is extended. This does not just mean its ability to generate coherent text, but rather to actually produce an answer that requires attention to a distant part of the context. The repository contains a detailed post reviewing a subset of the experiments we ran, a summary of the results and instructions for reproducing or building on the results.

https://github.com/abacusai/Long-Context

Since Llama is such a popular foundation for LLM experiments we are sharing delta weights with respect to the Llama v1. The weights (and datasets) are available for download from Huggingface.

huggingface.co/abacusai

We hope other LLM builders will find the information useful and look forward to connecting with people interested in discussing the problem and possibly collaborating on new ideas. While adapting pretrained models to new context lengths may remain relevant we also hope that teams building new open foundation models will find some of this data useful in guiding future iterations of position embeddings in their models.