Improving RAG Performance: WTF is Semantic Chunking?

This blog is one of a series that originated from our Lunch and Learn sessions at Fuzzy Labs. An opportunity for us to share interesting topics, practice our public speaking and have a nice lunch together!

Lets get into it, this blog is about semantic chunking. A topic covered by Oscar as part of his investigation into optimising LLM RAG systems.

What is Chunking?

Before getting into semantic chunking, let’s just take a step back and ask ourselves a fundamental question: why do we even chunk text at all?

When building a Retrieval-Augmented Generation (RAG) system, the first step is to create a knowledge base. This involves processing our data which is typically text-based data and storing it in a database for answering user questions later.

If we simply ingest the raw text from these documents, we’re left with massive blocks of text. For example, a book might have thousands of words in a single chapter, which isn’t ideal. Think of it like studying for an exam, you wouldn’t try to memorise the entire book in one go. Instead, you’d focus on one chapter at a time, practice questions for that chapter, and then move on to the next. Otherwise, it’s just too much information to handle at once.

Language models also have their limits, meaning that they can’t process unlimited amounts of text at once, and there are two key reasons for this.

Context Limit: Language models have a context limit, meaning there’s a cap on how much text we can pass to them.
Signal To Noise: Models perform better when given data that’s directly relevant. If we feed a model thousands of words where only a small portion is useful, we’re relying on it to sift through and figure things out, this is known to be inefficient and suboptimal. Instead, we should proactively remove irrelevant information to make the input as focused as possible.

Goal of Chunking

Text splitting or chunking is the process of breaking data into smaller, more manageable pieces, optimising it for both the task at hand and the language model. The real objective is to prepare the data in a way that makes it as useful and relevant as possible for our specific needs.

Things to consider when deciding a chunking strategies.

There are various methods we can use to chunk our data, however before deciding on which one to use we should ask: How can we process data such that it's optimised for my language model? There’s no one-size-fits-all solution, what works for someone else might not work for us.

How Do I Effectively Chunk Data For An LLM?

If you’re new to chunking and don’t know much about different strategies, a simple approach is to chunk text by a fixed character or word length. For example, if you have a document with 1,000 words, you could divide it into 10 chunks of 100 words. This approach has its pros and cons:

Pros: Extremely straightforward, easy to implement, and is an improvement on just feeding it an entire document.

Cons: Fixed-size chunking ignores the structure of the text. It may break at random points, disrupting the flow of information and context.

How Can I Improve Chunking?

To improve on fixed-size chunking, we can split the text based on natural breaks like sentence-ending punctuation (., ?, !) and set a maximum chunk size. This keeps some of the natural flow intact and reduces the risk of losing important context.

Three different topics within the same chunk.

Even with this improved approach, there are still drawbacks. Chunks can still span different topics and when compressed into a single vector embedding, their individual meanings become diluted. This is shown above, where three different sentences which talk about different topics (swimming, computers, and coffee) are all grouped together into the same chunk.

Ideally a vector embedding will capture the meaning of one specific topic, however with a fairly naive approach that splits based only on punctuation, we risk having mixed topics in each embedding. If you are interested in learning more about embeddings and their role in vector search, check out our blog on embedding!

Semantic Chunking

Here’s the problem with naive approaches, they rely on global constants like chunk size or common delimiters without considering the actual content. Think about it like this: when I’m writing this blog, I group sentences into paragraphs so that each paragraph focuses on the same topic. Semantic chunking aims to address this by grouping similar information together to ensure each chunk captures a clear, focused meaning.

How Does Semantic Chunking Work?

Instead of relying on arbitrary limits, we take an embedding-based approach. Embeddings allow us to capture the semantics of each input. Initially, we chunk the text using a naive method, then embed each chunk. The key idea is that we evaluate the embedding distances between chunks. If two chunks have embeddings that are close in distance, we group them together. If not, we leave them as separate chunks.

This approach is definitely more work and slower than naive chunking, but it gives chunks that are more meaningful and contextually relevant, thus improving RAG responses.

To summarise, this is the idea:

We generate embeddings for each sentence in a document, based on the assumption that each sentence typically focuses on a single topic.
Compare the embeddings to measure their similarity. Close embeddings imply related content, while distant embeddings suggest unrelated topics.
Group sentences with similar embeddings together using a threshold. This way, chunks represent cohesive ideas or topics.
Imagine you have a long essay. By embedding each sentence and comparing them, you can group sentences that “belong together” based on their meaning.

A paragraph is divided into three chunks based on their semantic differences.

What the above figure shows is that we aim to separate sentences or paragraphs into chunks, grouping only those that “belong together” based on their meaning. Otherwise, we treat them as separate chunks.

Crafting Meaningful Chunks

There isn’t a go-to formula for semantic chunking, just like there isn't a single best chunking strategy, it’s all about experimentation and iteration. The goal of semantic chunking is to make your data more valuable for your language model for your specific tasks.

However, we can start with a simple approach (shown in the figure below):

Split the document into sentences using punctuation (e.g., ., ?, !) or tools like spaCy or NLTK for more nuanced breaks.
Calculate distances between sentence embeddings.
Group similar sentences together or split sentences that aren’t similar.

A high-level visualisation of the process starts with the initial raw documents, which are then processed and transformed into semantically chunked text, ready for indexing.

What’s Next?

By now, you should have a solid understanding of the purpose of semantic chunking, how it works, and differs from naive chunking. But don’t just take my word for it, try it out yourself! We’ve provided a notebook example that demonstrates the improvement in results when we semantically chunk text compared to naive chunking. Feel free to check it out, and don’t hesitate to reach out if you have any questions.

That’s it, hope you enjoyed learning about semantic chunking, check out our other blogs on improving RAG performance.

You can find the code for everything discussed in this blog here.

Improving RAG Performance: WTF is Semantic Chunking?

What is Chunking?

Goal of Chunking

How Do I Effectively Chunk Data For An LLM?

How Can I Improve Chunking?

Semantic Chunking

How Does Semantic Chunking Work?

Crafting Meaningful Chunks

What’s Next?

More like this

Socks: Designed With Maths

Improving RAG Performance: WTF are Re-Ranking Techniques?

Improving RAG Performance: WTF is Hybrid Search?

Sign up to our newsletter