This blog follows our “MindGPT: DataOps for Large Language Models” blog, which can be found here.
In this blog, we delve into the concept of vector databases and how they can be used to enhance response times when interacting with LLMs. We explore the potential of these databases to improve efficiency, extract relevant documents from large datasets, and optimise the performance of MindGPT.
This blog forms part of a series on MindGPT, where we invite you to join us on our journey of developing a specialized LLM dedicated to mental health information. Each blog entry will delve into the technical aspects of a specific stage in our custom LLM's development. For a comprehensive introduction to the project, we recommend starting with our first blog in this series. To follow along with the code, check out our GitHub Repository.
Throughout this project, we aim to showcase the potential of open-source tools and models by demonstrating how you can create your own LLM. Additionally, we'll explore why open source plays a crucial role in ensuring the transparency and reliability of these LLMs. Stay tuned for useful insights and exciting updates on our progress!
What is a vector database?
Vector databases share similarities with traditional databases, but they are specifically designed and optimised for efficiently storing and retrieving high-dimensional vectors. Unlike traditional databases that excel in managing structured tabular data, vector databases are finely tuned to meet the specific demands and attributes of vector data, addressing their unique characteristics and requirements.
Computers can’t read, so how can we teach them? To enable computers to understand and learn from human language, natural language processing techniques rely on mathematical representations of language. These representations, known as embeddings, are stored as vectors. This approach allows for the transformation of words, sentences, and documents into a format that machines can process and analyse effectively.
You can read our “Journey to (Embedding) Space!” blog for more details on embeddings.
Many tasks within machine learning, especially natural language processing, require computing similarity between embeddings and returning the closest values, see K-nearest neighbours (supervised) or K-means clustering (unsupervised).
Vector databases are optimised for these operations. Vector databases use Approximate Nearest Neighbours (ANN) algorithms, which describe a set of techniques that allow for an approximation of the nearest neighbour algorithm with a smaller computational complexity. Generally, they fall into three categories: quantisation methods, space-partitioning methods, and graph-based methods. Some vector databases also allow GPU hardware acceleration for faster vector operations.
This allows for real-time document retrieval for documents with vector representations in time-sensitive applications with huge amounts of data which would not be possible with a regular database.
Vector Databases for MindGPT
Going back to MindGPT. We're using two website data sources, one from NHS Mental Health and the other from Mind Charity. With this data we want to repurpose a base LLM, specifically Google’s Flan-T5, to answer questions about the text in the data sources.
To repurpose the generic LLM for use in MindGPT we have two options:
Fine-tune the model
This is where the models’ weights (or a subset of weights) are updated based on new data fed to the model. Fine-tuning can be expensive and time-consuming and usually requires the data to be in a specific format. For example, in a question-answering task, we require three fields, Question, Context, and Answer. Our current data is not in this format and datasets like these take a lot of time and expert knowledge to produce manually.
Use in-context learning
This is where surrounding information from various documents is prepended to an inference input to provide the relevant context for extracting information from and producing an answer.
In the current version, we decided to use in-context learning as it is significantly cheaper and faster to develop and iterate.
Although large language models are inherently large, they have their limits. For example, the maximum context length for GPT-4 (the version that is specifically designed to have a large context) is approximately 32,000 tokens. For reference, the book Pride and Prejudice has 156,644 words and 75 words ~= 100 tokens. If we wanted to summarise a character within the book it would not be possible to do so by adding the whole book to the context.
So how can we obtain only the relevant sections of the NHS Mental Health and Mind webpages to provide context in the MindGPT LLM input?
The first step is to split our cleaned data up into sections, these can be sentences, paragraphs or longer texts and depending on the task and data at hand it may be more advantageous to use one or the other.
These sections are then embedded into sentence/document embeddings which represent the meaning of the entire piece of text, not just individual words. See the “Embedding Documents” diagram below for a simplified representation.
The next part is the retrieval, a prompt/input is also embedded and we get the closest embedded documents to the prompt’s/query’s embedding. See the “Document Retrieval” diagram below.
After the N closest documents are selected these can simply be concatenated and prepended to the LLM query to be used as contextual information for information extraction.
When we have a large document knowledge base, this retrieval can be slow. A standard Nearest Neighbour search (NNS) will require a distance calculation between a single embedding and every other document embedding within the database (time complexity O(n)). This is where the optimisations made by vector databases come in handy. These techniques allow for faster document retrieval, reducing the overall time required for model inference.
Our implementation will include a new ‘embedding pipeline’ which will take the preprocessed data, embed it into vectors, and upload it to the vector database. The vector database will be accessible in the LLM inference server which will retrieve the relevant documents to prepend onto the LLM input query.
For MindGPT we have chosen to implement Chroma DB, an open-source vector database that aims to simplify embedding documents and document retrieval. It also has integrations with LangChain, which we may use in the future to aid in document retrieval and creating effective and standardised prompts for the LLM. Some experimentation will be required to figure out the optimal text length to embed. If our documents are too short we risk missing important relevant information and document retrieval will be slower as there will be a greater number of documents, however, if the documents are too long we risk filling the input context with too much irrelevant information.
Deploying Chroma DB
We want Chroma to be deployed as a service that can be managed, scaled and monitored independently of the LLM whilst still being accessible where required. As we are using Kubernetes to run our data pipelines and deployments it makes sense to use Kubernetes to deploy the Chroma server. We use our open-source tool Matcha to provision the required infrastructure including a Kubernetes cluster (AKS) on Azure.
Note that, as Chroma DB is a very new tool the deployment method is subject to change frequently.
How will the model and the Streamlit app interact with Chroma? The diagram below shows how Chroma will be integrated into our infrastructure on Kubernetes. The database will be accessible by the Streamlit app (also hosted on K8s) which will query Chroma to retrieve the docs and create the whole query to be processed by the LLM.
Chroma DB provides a Docker image and we have created a Kubernetes manifest which allows us to deploy the Chroma server and backend database (Clickhouse) to AKS in a sidecar container pattern. Note that there exists a bug in the public Docker image.
The first step is to build the Chroma Docker container and push it to the Azure Container Registry (ACR). After this is completed, the Kubernetes manifests we created can be run to deploy the images onto the cluster. When the Chroma DB server is deployed on AKS it is accessible through a port within the cluster.
The Chroma database is populated by a new ZenML pipeline. This pipeline first collects a specified data version with DVC, embeds each piece of text, and uploads it to the database making it accessible by the LLM deployment image. To create the embeddings we will use an embedding model specifically created and fine-tuned for the task of embedding larger pieces of text. This model may vary depending on the length of the document we decide to use.
This is only a brief overview of our approach; a more comprehensive guide on deploying Chroma yourself will follow soon, providing detailed insights into the process. Stay tuned for a deeper understanding of how we have deployed Chroma.
What's next?
In our next blog, we will look at a run-through of our prototype implementation of MindGPT, encapsulating everything we have discussed throughout this and previous blogs. We look forward to seeing you there!