Improving RAG Performance: WTF is Hybrid Search?

This blog is one of a series that originated from our Lunch and Learn sessions at Fuzzy Labs. An opportunity for us to share interesting topics, practice our public speaking and have a nice lunch together!

Lets get into it, this blog is about hybrid search. A topic covered by Shubham as part of his investigation into optimising LLM RAG systems.

What are the different components of RAG?

Before diving into the specifics of how a Retrieval-Augmented Generation (RAG) is composed, it’s worth answering the question: What is RAG? In essence, it’s a way of providing additional information to a large language model so it can answer questions about that information. For example, you might have a company policy document. An LLM that you’ve downloaded from Hugging Face won’t know about that, especially if it’s a private document. RAG enables you to pass that document, or relevant parts of it (what we’ll cover in this blog!), so the LLM can use it to answer your question.

Now that we’ve established what RAG is, let’s look at how it’s composed.

Here are the core stages of a basic RAG application:

Indexing: Raw documents are ingested, pre-processed into text chunks, and enriched with metadata. Embeddings for each chunk are then stored in a vector database.
Retrieval: For a given query, its embedding is generated using the same model used in the Indexing stage. This embedding is used to fetch semantically similar documents from the vector store.
Generation: The query and retrieved documents are fed into an LLM using a prompt, which is then used to generate the final response.

In this blog, we’ll focus on the first two stages, Indexing and Retrieval, and explore how data enters and exits the vector database.

Indexing and Retrieval

Building Our Database

One of the first things we do when building a RAG system is setting up our database. This means pre-processing and storing the data we think will be useful for the LLM when answering the kinds of questions we expect users to ask.

This step is called indexing. But what do I mean by pre-processing the data? Well, when a user asks a question, we need a way to find the most relevant documents. That’s where vector search comes in. But as the name suggests, it requires vectors.

To create these vectors (which are just numerical representations of text), we use an embedding model. This model converts our text data into dense vector representations, which we then store in a vector database alongside the original text and metadata.

Finding Relevant Documents - Retrieval Stage

In the retrieval stage, our goal is to find relevant documents that can help the LLM answer the user’s query. To do this, we take the query, convert it into an embedding vector using the same embedding model we used for indexing, and compare it against the stored embeddings.

Different Types of Retrieval Method

Dense

Dense retrieval, also known as embedding or semantic retrieval, is one of the most commonly used methods for finding relevant documents.

The idea is simple: the closer the vectors are, the more relevant the document is to the query. This method of finding relevant information using vector similarities is known as dense retrieval.

When we retrieve documents, the query's embedding is compared to the embeddings in the vector database, and the closest matches are returned.

‍Pros:

Captures the semantic meaning of text, making it easier to find relevant information even if different words are used.

Cons:

Embedding models require a lot of memory.
Storing dense vectors takes up significant storage space.
If the pre-trained model doesn’t align well with new data, retraining may be necessary.
Not great for exact keyword matching, which can be a drawback in some cases.

Sparse Retrieval: A Term-Based Approach

The second method we will look at is sparse retrieval, also known as term-based retrieval. When we do sparse retrieval, we focus on finding relevant data based on key words and their significance in a query, rather than using dense embeddings.

These models don’t just identify important terms; they also account for variations like synonyms. By using a dictionary, they map each term to its importance, making the search more precise and keyword-focused. Below is an example of what I meant by using a dictionary. Each word is associated with a number which signifies their importance.

‍Pros:

Precisely captures keywords and their synonyms.
Sparse vectors require less storage compared to dense vectors.

‍Cons:

Still memory-intensive due to the transformer-based model.
Requires retraining for mismatched data.

Full-Text Retrieval with BM25

So far, we’ve looked at Dense Retrieval (which understands meaning) and Sparse Retrieval (which focuses on specific words). But there’s another approach that powers traditional search engines: BM25, an improved version of TF-IDF (Term Frequency Inverse Document Frequency).

TF-IDF ranks words by importance in a document, boosting rate but meaningful terms while downplaying common ones.

BM25 improves on TF-IDF by taking into account things like document length and term saturation.

I won’t go too deep into the formula for BM25 and how it works, but if you are interested, here’s a good article where you can learn more about the mathematics behind BM25.

BM25 improves ranking by factoring in these elements. For example, if you're searching for "machine learning models" in a collection of research papers, a simple TF-IDF approach would rank documents higher if they contain those words frequently. But that can be misleading as just because a term appears a lot doesn’t necessarily mean the document is the best match. BM25 adjusts for this by:

Handling Term Saturation: If "machine learning" appears 100 times in a long paper and 5 times in a shorter but more relevant article, BM25 won’t blindly favor the longer one. It limits how much repeated terms boost a document’s score.
Adjusting for Document Length: Shorter documents don’t get unfairly penalised for having fewer total words. BM25 balances this so that a concise, relevant document can still rank highly.

Unlike dense or sparse retrieval methods, BM25 is purely statistical. It doesn’t rely on vector embeddings but instead looks for exact matches between the query and document content. This makes it simple, efficient, and a great choice when you need quick, precise results.

‍Pros:

Fast computation due to its mathematical nature.
Efficient for exact string matching.
Minimal storage requirements.

Cons:

Doesn’t account for synonyms—only exact word matches.
Needs scores to be recalculated when new data is added.

A Combination of Different Approach

Instead of relying on just one retrieval method, we can combine all three approaches—dense, sparse, and full-text retrieval—to get the best of each. This is known as hybrid retrieval, where each method finds relevant documents independently, and a separate algorithm decides which results to keep.

For example, imagine you're searching for "best running shoes for marathon training."

Dense retrieval might return articles discussing "long-distance running footwear" because it understands the semantic meaning.
Sparse retrieval could prioritize documents with exact matches for "best running shoes."
Full-text retrieval (BM25) might highlight product reviews where "running shoes" appear frequently.

By combining these methods, you get a more complete and relevant set of results.How Do We Merge the Results?There are different ways to combine and rank documents from multiple retrieval methods, the component that is responsible for this is called a result collector and we will look at the 3 types of result collector today.

Reciprocal Rank Fusion (RRF): Uses a mathematical formula to merge rankings, ensuring that even lower-ranked results from one method can contribute if they are consistently relevant across multiple methods.

Simple Weighted Fusion: Assigns different weights to each method and combines the scores accordingly. For instance, if sparse retrieval is more reliable for your use case, it might get a higher weight.

Re-Ranker Models: These are cross-encoder models that take the top retrieved documents and rescore them based on their actual relevance to the query.

Hybrid Retrieval Pros and Cons‍

Pros:

We can maximise precision and recall by leveraging the strengths of each method.
Combines complementary strategies for improved relevance.

Cons:

More complex to implement and tune.
Slows down the retrieval process since multiple methods run in parallel.

What’s Next?

By now, you should have a solid understanding of the different types of retrieval methods, how they work, and their strengths and weaknesses. But don’t just take my word for it—try it out for yourself!

We’ve put together a notebook example that demonstrates the strengths and weaknesses of each method.

That’s it for this Lunch & Learn insight, hope you enjoyed learning about hybrid search, if you'd like to learn about other ways to improve RAG then checkout a couple of our other lunch and learn blogs.

Improving RAG Performance: WTF are Re-Ranking Techniques?

Improving RAG Performance: WTF is Semantic Chunking?‍

Thanks for reading, and see you next time!