Giraffe – Long Context LLMs

Following up on the work from our previous blog post, we are releasing today an arXiv paper titled “Giraffe: Adventures in Expanding Context Lengths in LLMs”.

Giraffe is a new family of models that are finetuned from base LLaMA and LLaMA2 that we release. We include a 4k Giraffe and 16k Giraffe finetuned from LLaMA, and a 32k Giraffe finetuned from LLaMA2 and release their weights on HuggingFace. We also release our training code, evaluation datasets, and evaluation scripts to the research community.

This paper explores the topic of context length extrapolation of Large Language Models (LLMs), which have been highly successful in natural language modelling tasks in recent years. Context length extrapolation is the use of a LLM that is trained on a short context length for evaluation on longer context lengths, without any further training on the long context being involved. Having a model that is capable of extrapolating to longer contexts is important for a range of tasks. For example, if you wanted to ask an LLM to retrieve information from a large corpus of data that you own, having a larger context capacity allows the model to attend to more of the corpus (or maybe even all of it!) at once, so it can do more complex retrievals with fewer mistakes. This could also be important for maintaining long conversations (think with an AI-powered chatbot), or for asking an LLM to help with coding on a large existing codebase.

Why can’t we just train the model on longer contexts though? The primary reason for this is that a key component of modern LLM architecture – called self-attention – scales quadratically in both memory and compute as context length increases, so there will quickly be a point where you don’t have sufficient GPUs, time or money to train for longer contexts. Hence having a method that can zero-shot extrapolate to context lengths never seen before is key.

There have been many methods proposed in recent work on context length extrapolation. In this paper we collate together what we believe are the most prominent approaches and test them thoroughly to determine which are the most effective. We also propose a couple of new approaches, one of which we call truncation and which shows some promising results.

One of the difficulties of assessing LLM performance is choosing the right evaluation methodology. The most commonly used metric in the literature is next-token perplexity, which measures how good the model is at predicting the next token given the preceding context. However, we believe that documents on which perplexity are measured often permit for a good performance simply by producing reasonably coherent text distributions conditional on a small subset of the entire available context. We show in this paper that perplexity is less sensitive for distinguishing long context performance between models for this reason than the new tasks that we introduced that are focused more strongly on accuracy of model recall than general text coherence.

These new tasks are LongChat-Lines, FreeFormQA and AlteredQA. The first extends a key-value retrieval task introduced by LongChat to longer contexts. FreeFormQA and AlteredQA are formed from the Natural Questions Dataset and are question-answering datasets based on Wikipedia. We release all three tasks as HuggingFace datasets.

There are still many open questions in the space of context length extrapolation of LLMs. As we show in the paper, none of the methods we investigate currently fully satisfies the brief of being able to truly extrapolate without some element of degradation in performance starting to creep in. We are interested in researching further on this topic and finding ways to resolve some of these issues; watch this space!

Latest posts by Arka (see all)
Related posts

AI Agents - Build and Host LLM Apps At Scale


Data LLM: Get insights from your data


Create a CustomGPT And Supercharge your Company with AI  -  Pick the Best LLM


Treating Attention Deficit Disorder in LLMs

Leave a Reply

%d bloggers like this: