Improving RAG Performance: WTF are Re-Ranking Techniques?

Hi everyone, it’s Oscar from Fuzzy Labs. If you’ve been following us on LinkedIn, you may have heard about our Lunch & Learn sessions. These take place every two weeks, it’s where we come together to present interesting topics and have lunch as a team. This is the first part in a blog series to share more of what we’ve learned in these sessions with everyone.

Without further ado, let’s jump into the first post of our blog series. Today, it’s about re-ranking, a topic I covered in one of our recent sessions. Before we dive in, let's take a step back and quickly talk about RAG (Retrieval-Augmented Generation).

What is Retrieval-Augmented Generation (RAG)?

Unless you’ve been living under a rock, you’ve probably heard of ChatGPT, large language models (LLMs), and RAG, which is the main topic we’ll talk about today. So, what exactly is RAG? In simple terms, RAG is when we provide additional context to a language model so it can answer questions that it wouldn’t otherwise know about.

For example, imagine you’re using ChatGPT to ask about a newly published research paper. Since ChatGPT’s training data doesn’t include this recent paper, it won’t know about it. But, if you manually copy a relevant paragraph from the paper and feed it to the model, ChatGPT can use that context to help answer your question.

In a RAG system, we have a knowledge base (or database) containing information that the system can retrieve to provide context for the model. The model then uses this context to generate an answer to the user's query. This is the core idea of RAG.

Retrieving relevant data from knowledge base

In a chatbot, user experience is crucial, and no one wants to wait too long for a response. To get fast responses from a RAG system, we need a way to retrieve relevant documents efficiently. This is where vector search comes into play.

How does vector search work?

When we build our knowledge base, we transform text into vectors which are compressed numerical representations of the semantic meaning behind the text. To do this, each sentence is passed through a transformer model, which generates its corresponding vector. These vectors are then placed into a vector space in our knowledge base.

Embedding sentences into an embedding space using a transformer model to capture semantic similarity

‍

When a user asks a question, we want to encode the query into a query vector so that we can compare the query vector to the document vectors in our database, returning the closest matches as relevant documents.

Although vectorising our documents allows us to do vector search to efficiently retrieve documents, it inevitably compresses information, meaning some details are lost. This price is generally worth paying because vector search offers speed and relatively accurate results.

Vector search is a common approach in RAG systems due to its simplicity and speed. After all, no one wants to wait five minutes for an answer.

What is the limitation of vector search?

Although vector search is fast, creating embeddings from text means that some of the semantic meaning of text might get lost during compression. This loss of information leads to a recall problem, where important documents are not retrieved. The term recall is key here, it’s the metric we’ll aim to improve through reranking.

But what exactly do I mean by recall and missing relevant documents? Well, there are two types of recalls we need to understand.

Retrieval Recall: This is the proportion of relevant documents retrieved compared to the total number of relevant documents in the knowledge base. For example, if I ask a question about gradient descent and there are three documents in the knowledge based on this topic, a perfect retrieval recall would mean retrieving all three documents.

How recall is calculated for a top-k value

‍

LLM Recall: This is the language model’s ability to effectively use the context it’s given. This research shows that as the number of tokens in the context window increases, the LLM’s recall tends to decrease. Essentially, feeding too much information to the LLM can overwhelm it and reduce its effectiveness.

The challenge

The process of querying a vector database to fetch the most relevant results relative to an input query

When we do vector search, we typically limit the results to the top-k most relevant documents because we want to ensure that the LLM receives only the most useful and relevant information to maximise its recall.

Some relevant results could be outside of the k cut-off

So what happens if one of the useful documents is below the top k cut off, like result 8 as shown on the diagram above? This could happen such as when most of the text in result 8 talks about an unrelated topic or maybe it contains an equation. So vector search did not return it as the top 3 most relevant results.

When this happens, we will get a low retrieval recall, as we may miss relevant documents that didn’t make it into the top-k results.

One way to improve retrieval recall is to increase the number of documents returned by the vector search. However, this comes at the cost of LLM recall, as too much information can degrade the LLM’s performance.

So, the challenge becomes finding a balance to maximise:

Retrieval Recall: Retrieve as many relevant documents as possible to ensure none are missed.

LLM Recall: Pass only the most relevant documents to the LLM to avoid overwhelming it.

Re-ranking

Re-ranking is a two step process, the idea behind is to first retrieve a large number of documents from our vector store using computationally cheap queries and then apply more computationally expensive techniques to identify the highest-quality matches.

Re-ranker model

A re-ranking model is typically a cross-encoder, a type of transformer model designed for evaluating the relevancy of query-document pairs.

Unlike vector search, which compares precomputed embeddings, a cross-encoder processes the query-document pair together in a single inference step. This ensures that no information is lost during vector compression and enables the model to deeply evaluate the semantic relevance of the pair.

But why? What makes reranking better than just vector search? Well, transformers tend to be better at capturing relationships between words and understanding semantic nuances, thanks to their self-attention mechanism, a key component of modern LLMs that helps models focus on important parts of the input and capture contextual relationships for a deeper understanding of meaning and relevance. If you are interested in how it works, you should definitely check out the original paper. In short, transformers provide a more in-depth relevance comparison between two pieces of text than compared to simpler approaches like cosine similarity, which is commonly used for vector search. If you'd like to explore more about the metrics used in vector search, you can learn about them here.

The Two Steps Process

Vector Search: When we build our knowledge base with the embeddings of our document. These embeddings later allow us to perform similarity searches fast since they are precomputed. To increase our retrieval recall, we want to get a much larger set of results, the top 20 instead of only the top 3 as an example.

Reranking: We take the top 20 results from vector search and pass them into a reranker for evaluation. Unlike vector search, rerankers are more computationally expensive models, which allows them to accurately analyse the semantic relevance of the query-document pair.

Because reranking is applied to a smaller subset of documents, we can afford to use these more complex and resource-intensive models without significantly impacting performance. This is the step that ensures the documents we feed to the LLM are the most useful and valuable.

A re-ranker sits between the top-k results and the final set of results

Simple, right? It’s just another step added in.

The idea is that retrieving a larger set of documents will increase the retrieval recall and by reranking them, we ensure only the most relevant ones make it through, thus increasing the LLM recall.

Why should I use re-ranking?

Think of it this way, vector search is fast but only relatively accurate. Rerankers, on the other hand, are slower but far more precise. By combining the two, we get the best of both worlds; the speed of vector search and the better query-document relevancy evaluation of reranking.

What’s next?

By now, you should have a solid understanding of what reranking is, how it works, and where it fits in a RAG system. But don’t just take my word for it, try it out yourself! We’ve provided a notebook example that demonstrates the improvement in results when using a reranker compared to naive vector search. Feel free to check it out, and don’t hesitate to reach out if you have any questions.

Improving RAG Performance: WTF are Re-Ranking Techniques?

What is Retrieval-Augmented Generation (RAG)?

Retrieving relevant data from knowledge base

How does vector search work?

What is the limitation of vector search?

The challenge

Re-ranking

Re-ranker model

The Two Steps Process

Why should I use re-ranking?

What’s next?

More like this

Socks: Designed With Maths

Improving RAG Performance: WTF is Semantic Chunking?

Improving RAG Performance: WTF is Hybrid Search?

Sign up to our newsletter