
The launch of ChatGPT has created a pivotal moment, enabling enterprises to envision new use cases and accelerated the adoption of AI within these corporations. One such use case in the enterprise space is the ability to have a conversation with a chatbot and get answers to questions based on the information present in the company’s knowledge base. However, ChatGPT or any other LLM has not been trained on this data and cannot answer questions based on this internal knowledge base. So, an obvious solution that comes to mind is to provide the internal knowledge base as context to the model, i.e as part of the prompt. Nevertheless, the majority of LLMs have a token limit in the mere thousands, insufficient for accommodating the comprehensive knowledge bases of most enterprises. Therefore, utilizing an off-the-shelf LLM alone is not sufficient in addressing this challenge. However, there are two popular approaches outlined below that can be employed individually or in conjunction to tackle this problem.
Fine-Tuning an open source LLM
In this approach, an open-source LLM like Llama2 is fine-tuned on the customer’s corpus. This enables the fine-tuned model to internalize and comprehend the customer’s specific domain, allowing it to answer questions without the requirement for additional context. Nonetheless, it is important to note that many customers possess limited corpora, which often contain grammatical errors. This can present challenges when fine-tuning the LLM. However, promising outcomes have been observed when utilizing a fine-tuned LLM within the Retrieval Augmented Generation technique discussed below.
Retrieval Augmented Generation
The second approach to the problem is Retrieval Augmented Generation (RAG) which involves chunking the data, storing it in a vector store, and retrieving only the relevant chunks based on the query and passing it to an LLM to answer the question. A few open-source techniques involving an LLM, a vector store, and an orchestration framework have become popular on the internet. Below is an illustration of one such solution that uses the RAG technique.
Source: https://mohitdsoni.medium.com/training-chat-gpt-on-your-data-efaa7b7f521b
However, there are a few challenges in building a solution using the above approach. The performance of this solution depends on a bunch of factors such as the chunk size, overlap between chunks, embedding technique etc and the onus is on the user to figure out each of these. Below is a list of factors that can affect the performance
The chunk size to split the documents
As discussed earlier, LLMs are limited on the context length and hence it is required to chunk the document. However, the size of chunks play a critical role in the performance of your solution. Smaller chunks cannot answer questions that require analyzing information spread across multiple paragraphs, while larger chunks can quickly eat up the context length, which means the context can accommodate fewer large chunks. Also, the chunk size and the embedding technique together determine the relevancy of the chunks retrieved to answer the question.
Overlap between adjacent chunks
Overlap is necessary to ensure information is not abruptly cut off when chunking your documents. You want to ideally ensure that all the context required to answer the question is fully present in at least one of the chunks. While using a high overlap can solve for this problem, it introduces a new challenge where multiple overlapping chunks contain similar information filling up top search results with duplicate information
Embedding Technique
Embedding technique is basically the algorithm to convert your chunks into vectors that are then stored in the document retriever. The technique used to embed the chunks and the question determines the relevancy of chunks retrieved and served to the LLM
Document Retriever
Document Retriever, also commonly referred to as vector store, is a database that is used to store the embeddings and retrieve them with minimal latency. The algorithm used in the document retriever to match the nearest neighbors (e.g dot product, cosine similarity) determines the relevancy of chunks that are retrieved and served to the LLM. Also, the document retriever should be able to scale horizontally to accommodate large knowledge bases
LLM
A key component of your solution lies in the choice of a Large Language Model (LLM). The selection of the best LLM depends on various factors, including your dataset and the other factors discussed above. To optimize your solution, it is recommended to experiment with different LLMs and determine which one delivers the best results. While some organizations embrace this approach, others have restrictions that prevent them from utilizing closed-source LLMs like GPT4, Palm, or Claude. Abacus.AI offers a range of LLM options, including GPT3.5, GPT4, Palm, Azure OpenAI, Claude, Llama2, as well as Abacus.AI’s proprietary LLMs. Additionally, Abacus.AI has the capability to fine-tune an LLM on your data and use it in the Retrieval Augmented Generation technique, enjoying the best of both worlds
Number of chunks
Answering certain questions requires information present in different sections of the document or sometimes across documents. For instance, answering a question such as “List a few movies that contain wildlife” would require excerpts or chunks from different movies. Other times, the most relevant chunk might not surface to the top of the vector search. In these cases, it is important to pass multiple chunks of data for the LLM to evaluate and generate response
To fine-tune each of the above parameters requires serious effort from the user and involves a ton of manual evaluation.
Abacus.AI’s approach
To solve this problem, Abacus.AI has taken a novel approach that provides the user with AutoML capabilities to iterate through various combinations, including fine-tuning an LLM, and finding the combination that works best for the use case. All you need in addition to your documents, is an evaluation dataset that contains a list of questions and handcrafted answers to each of these questions. Abacus.AI uses this evaluation set to compare it to the responses generated from each combination of the above to determine the optimal combination.
Abacus.AI generates the following metrics and the user could choose their preferred metric to determine the winning combination
BLEU Score
BLEU (Bilingual Evaluation Understudy) Score is an automatic evaluation metric commonly used to assess the quality of machine translation output. It was introduced to address the need for a quantitative measure of translation quality that correlates well with human judgments.
The BLEU Score compares the candidate translation (output) to one or more reference translations (human-generated translations) and computes a score based on the degree of n-gram overlap between the candidate and reference translations. It evaluates the precision of n-grams (sequences of n words) in the candidate translation against the reference translations.
METEOR Score
METEOR (Metric for Evaluation of Translation with Explicit Ordering) Score is an automatic evaluation metric commonly used to assess the quality of machine translation output. It was designed to address some of the limitations of other evaluation metrics like BLEU (Bilingual Evaluation Understudy) by incorporating explicit word order matching and considering synonyms and paraphrases.
BERT Score
BERT Score is an automated evaluation metric designed for assessing the quality of text generation. BERT score calculates a similarity score between tokens in the candidate sentence and tokens in the reference sentence. Rather than relying on exact matches, this metric leverages contextual embeddings to determine token similarity.
ROGUE Score
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) Score is a set of automatic evaluation metrics commonly used in natural language processing and text summarization tasks. It was originally developed for evaluating the quality of text summarization systems but has since been widely used for other tasks, such as machine translation and text generation.
The ROUGE Score evaluates the quality of a generated text (such as a summary or translation) by comparing it to one or more reference texts (usually human-generated summaries or translations). It measures the overlap of n-grams (sequences of n words) and word sequences between the candidate and reference texts.

Each of the aforementioned scores falls within the range of 0 to 1, with a higher score indicating superior model performance. With Abacus.AI, you can try out different models and metrics to quickly figure out what gives the best results on your data for your use case.
If you are interested in trying this out or want us to do a free POC on your data please reach out to us at contact@abacus.ai
- AI Agents – Build and Host LLM Apps At Scale - August 31, 2023
- Data LLM: Get insights from your data - August 24, 2023
- Create a CustomGPT And Supercharge your Company with AI – Pick the Best LLM - August 10, 2023