This document explores a multi-layer approach to ingesting and retrieving a large amount of documents in RAG applications.
Every database must be designed with its primary goal in mind, and the goal of the knowledge base of RAG applications is efficient and accurate retrieval.
Efficiency has to do with performance and these operations are slow. Slowness may be acceptable in asynchronous services, but not much in chatbots.
Accuracy has to do with the quality of the response. Low-accurate cooking advice may be tolerable because the risk of adding too much salt is low. On the other hand, medical advice must be as accurate as possible for the obvious reasons.
Retrieval
Splitting a document into chunks is necessary because LLMs have a limited context length. A chunk could be anything between a set of paragraphs or a single sentence. The smaller the chunks, the higher the probability that a single chunk does not mention the main topic of the whole document. On the other hand, more small chunks can fit in the LLMs’ context.
When a chunk does not contain explicit references to a topic, it is less likely to be retrieved for that topic. That happens because retrievers explore chunks without their context. A chunk may mention rockets in the context of space exploration or in the context of warfare, but the two contexts may be indistinguishable in isolation. Moreover, the question may refer to rocket salad.
A solution for that problem is to manipulate the chunks. Appending a sentence describing the document’s topic increases the chunks’ likelihood of being found for the document’s topic and avoids confusion.
If a document is particularly long, there may be a significant distance between the whole document’s topic and single chunks’ topics. For these cases, it is advisable to split the main document into large chunks and process each chunk as a “sub-document” to create a second layer of small chunks. The retrieval then works on two layers: the primary retriever explores the top layer to find the most relevant sub-documents, and then the secondary retriever explores the small chunks belonging to the most interesting sub-documents found in the first step. It is possible to implement this by tagging the chunks with their type (subdocument / smallchunk) and the small chunks with their sub-document id.
For chunks, it is recommended to consider also another potential problem. A concept may require multiple sentences or even several paragraphs. The retrieval is less effective if the creation of chunks splits concepts into multiple chunks. The simplest strategy to overcome the problem is to have overlapping chunks so that every chunk contains a small portion of the previous one, thus preserving part of the flow.
Document Type
Ingesting raw text is simple, PDFs represent a challenge, and it is barely possible to ingest a rough description of an image. However, PDFs are probably the most interesting documents to ingest because they are the default format for books and scientific papers. Being able to parse them correctly can make a big difference in RAG applications.
The complex layouts of PDFs can confuse the reader: if you ever tried to select text in a PDF, you may have noticed that it is not always as easy as it looks. Tables are typically broken, the text is not always selected in the expected order, and the most creative layouts make everything worse.
Raw text extraction from PDFs is not a problem in general, but the result is mediocre at best, with the inevitable pollution of sentences. For example, a paragraph close to a picture may include the picture’s caption in its middle, or list elements may be chained to form a sentence with no punctuation that does not make much sense.
A solution for that is to use layout-aware parsers that analyse the position of text on the page to identify the relationships between elements. They are not perfect, but they are pretty good at breaking down text into consistent paragraphs and attaching them to the relevant heading. They can successfully identify column layouts and produce acceptable fine-grained chunks that can be refined or aggregated in post-processing.
There are also OCR-based parsers that essentially transform the PDFs into images that try to read them. They are much slower and have low accuracy compared to other solutions.
Reranking
Vector stores excel at finding documents containing words close to the question, but they are not very smart. Even with the best strategies, they may ignore important documents only because they do not contain relevant keywords.
Increasing the number of retrieved documents increases the probability of retrieving interesting ones that happen to have a low score, and it is an effective strategy to mitigate the issue.
Note: since multiple documents must fit in the LLM’s context, it is typically better to proceed with short summaries instead of whole documents unless the length is short.
When a list of potentially interesting documents is found, LLMs can perform reranking with prompts such as:
"A list of documents is given in <doc> tags. A question is also provided in <question> tags. Rank the documents in order of relevance to answer the question. The output should be in the following format:
<list>
<doc><id>id1</id><score>score</score></doc>
<doc><id>id2</id><score>score</score></doc>
<doc><id>id3</id><score>score</score></doc>
</list>
The answer is:"
That approach is becoming a standard, but it requires some tweaking. Even with standard libraries implementing the best practices, it happens that the parsing of the answer fails because the LLM does not follow the expected format or it directly addresses the question ignoring the documents. Conclusions Document retrieval is the core of RAG applications and the main challenge to solve. Although some strategies are becoming standard practice, there is still lots of tinkering to create the ideal pipeline and find the right trade-off for the final solution.
Some of my experiments can be found in my GitHub. It’s a bit messy, but the READMEs contain lots of notes.
It’s still a work in progress so I will publish some updates for sure.