In the previous blog, we introduced Matcha, our open source tool for provisioning machine learning infrastructure on Microsoft® Azure. Matcha provides an easy way for data scientists and ML engineers to provision the infrastructure required for running their machine learning workflow. Matcha is completely open source and built on top of awesome open source tools such as; MLFlow, to track experiments; ZenML, for workflow orchestration; Kubernetes, for model training workloads, as well as for hosting everything else; and Seldon, to deploy and serve models in production.
In this blog, we showcase how we use Matcha to finetune and deploy a Large Language Model (LLM) in a production setting.
Introduction
LLMs are the talk of the town! For those living under a rock, LLMs are a family of machine learning models built using transformer architecture, introduced to the world in the seminal “Attention Is All You Need” paper published in 2017.
Creating an LLM
Creating a performant LLM from scratch is beyond the means of most of us. In short, the recipe to create an LLM from the ground up is to train a large transformer model, consisting of billions of parameters, on an even larger dataset, representing trillions of tokens, using a ridiculous amount of compute, which we expect would amount to millions of dollars. The output of this process is a model with emergent abilities, such as being able to generate natural-language responses, reason (up to a certain degree), and provide endless, fun interactions. The outputs produced by these models, however, are not always consistent with human expectations.
Fine-Tuning
Fine-tuning is the everyman’s approach to creating a LLM. Fine-tuning refers to the process of taking an established LLM and training that model on a domain-specific dataset. Fine-tuning has an added benefit when compared to an off-the-shelf, pre-trained LLM; since the model is trained on a particular task, or specific dataset, it learns to perform especially well at that particular task. We want to demonstrate how to fine-tune an LLM using open source tools and modest hardware requirements.
In the following sections, we explain the LLM task and dataset, outline the criteria we used for choosing a LLM, talk briefly about how to fine-tune an LLM, and show how Matcha can be used to run this LLM example. Finally, we discuss how the Matcha tool is helpful for LLMOps.
LLM task
The LLM task that we are focusing on as part of this example is finetuning a LLM that can summarize a given segment of legal text. We use the ‘plain english summarization of contracts’ dataset, curated and open sourced by Manor, Laura and Li, Junyi Jessy. The dataset contains summaries of privacy policies and terms & conditions of various websites. Here’s one example extracted from the dataset:
Input text: We may also automatically collect device specific information when you install access or use our services. this information may include information such as the hardware model operating system information app version app usage and debugging information browser information ip address and device identifiers.
Summarized text: the service may use tracking pixels web beacons browser fingerprinting and or device fingerprinting on users
Open Source LLM
Almost every week we get a new, breakthrough LLM that represents a step forward in efficiency. It’s actually getting difficult to keep track. There are many different options to choose from in the world of LLMs. They often come in different sizes, where the smallest are typically in the millions-of-parameters range, and the largest variants comprise many billions of parameters. Distributing these models can be a challenge and most vendors use HuggingFace Hub, a platform for sharing ML models. HuggingFace provides easy access to all these models using the huggingface_hub library.
Not all vendors choose the open source approach, some models are closed source and can be accessed only using a proprietary API. Closed source examples include the three GPT model variants by OpenAI, and Claude by Anthropic. This limits the access to commercial use. Some models are labelled as open source like Meta's LLaMA, because the weights are available to the public. To access LLaMA, however, the public has to explicitly fill out a form to download the model weights. The LLaMA derivatives (models derived from LLaMA weights, known as deltas) include Alpaca, Koala, and Vicuana, all of which got tainted by the same restrictive license, making these models available only for non-commercial purposes. There has been rapid progress in the availability of genuinely open source LLMs through projects like RedPajama, which replicates the exact process for training LLaMA. Open-Assistant is a project that replicates Chat-GPT with the data, model and code all available for any one to create their own. The quality of these open source LLM ecosystems represents a meaningful closing of the gap when considering performance relative to their closed source counterparts.
As you can see, the devil's in the details when it comes to choosing the right LLM. Fear not, we’ve got you covered. Next week, we will be releasing a blog covering this exact topic. Keep your eyes peeled 👀.
Choosing a LLM
Having decided what task we were focusing on, it was time to shop for an LLM. We defined the following criteria to help us select the best open source LLM for the task.
- License: We are a company doing open source MLOps. Open source software means that it can be seen, modified, and distributed by anyone. Many LLMs have proprietary licenses, others are disguised as open source but aren’t in practice; only a small number of LLMs can be considered truly open source. These include Google’s T5 and UL2 family models, EleutherAI’s Pythia model, and the OpenLLaMA model from OpenLM Research. A more comprehensive list of open source LLMs can be found here.
- Model size and parameter count: LLMs come in various sizes, as measured by file size and number of parameters. For example, consider the flan-t5 family of models released by Google. These models come in 5 different variants.The flan-t5-small variant contains 5 million parameters and uses approximately 300 MB of disk space. Similarly, flan-t5-base contains 250 million parameters and takes up about 1 GB of space. The largest variant - flan-t5-xxl - contains a very credible 11 billion parameters and weighs in at around 50 GB.
- Fine-tuning and deployment resources: We have to take into consideration the resources required for fine-tuning and deploying our selected LLM. As you can imagine, fine-tuning and deploying the flan-t5-xxl variant is not going to be a walk in the park. Just to ballpark the scale of the task, it might require something in the region of 8 NVIDIA A100 GPUs with 40GB of vRAM and 1000 GB of disk space to be provisioned. We also have to take into account the cost of this fine-tuning; we’d (well, ChatGPT would) guestimate around $10,000, but don’t hold us to it.
- Fine-tuning and inference time: The time required to fine-tune our LLM and get back inference from that model should also be considered. We don’t want to be sitting, staring at a blank screen for five minutes, waiting for our model to yield a response.
Experimenting with different flan-t5 variants, we choose to stick with the smallest variant flan-t5-small as it ticked all our boxes above.
- License: Apache 2.0 license
- Model size and parameter count: a respectable 5 million parameters and approximately 300 MB in disk space.
- Fine-tuning and deployment resources: One Standard_DS3_v2 Azure VM instance is sufficient for both fine-tuning and deployment.
- Fine-tuning and inference time: It took about 10-15 mins to fine-tune our model across 5 epochs on a kubernetes CPU cluster. The inference time is also rather quick, taking only 200 ms for our response.
- The only key downside being the quality of our model’s predictions relative to other flan-t5 LLM family members.
How our LLM example works
LLM fine-tuning
Trained LLMs, i.e. LLMs trained from scratch, contain model weights derived from training on a diverse set of tasks, summarization being one of them, in a self-supervised or semi-supervised fashion. For our summarization task, we can easily use the pre-trained LLM and get back inference using our legal text input. However, the performance of this pre-trained LLM will not be comparable to a LLM fine-tuned for a specific task. Fine-tuning is the process by which we modify the weights of the trained model using a training dataset representing a specific task. There are different ways to perform fine-tuning for LLMs, but that’s a topic for another day. In the coming weeks, we will be writing a deep-dive blog into the specifics of LLM fine-tuning. Here, though fine-tuning, we aim to change all the weights of pre-trained LLM, adapting it to our specific task, i.e. summarizing legal text.
LLMOps Pipelines
To create an LLM machine learning workflow, we use our favorite tool; ZenML. ZenML allows us to create MLOps pipelines that are tool and cloud agnostic. We can also break this task down by creating two ZenML pipelines. ZenML pipelines can be thought of as a collection of ZenML steps. Each step is like a function performing a particular task, e.g. downloading datasets, preprocessing inputs, etc. Pipelines bring together and run all the steps, which may have interdependencies. For this task, we created a pipeline for fine-tuning our LLM and a separate pipeline for deploying it.
Fine-tuning LLM pipeline: This pipeline consists of 5 steps. Note: We used the Hugging Face transformer library that provides easy access to the flan-t5-model through Hugging Face hub. The library provides optimized classes for training and fine-tuning transformer models.
- download_dataset : This step was responsible for downloading the dataset and saving it to a json file.
- convert_to_hg_dataset: This step reads the dataset saved in our json file and converts it into a Hugging Face dataset format yielding a Dataset object.
- get_huggingface_model: This step downloads the tokenizer and model for the flan-t5-small model from the HuggingFace Hub.
- preprocess_dataset: This step preprocesses the input by appending “summarize : ” before each input text. The LLM identifies that this is a summarization task based on this added prefix. It also truncates and pads input text to be of the same length. Finally, the preprocessor function tokenizes the input texts and labels, converting text to numbers using the tokenizer from the previous step, and then splits this entire, processed dataset into training and testing sets.
- tune_model: The final step in the pipeline takes an input from the tokenized training and testing dataset, and combines it with the tokenizer and model from the get_huggingface_model step to perform fine-tuning. ZenML pushes the output of this step, i.e. a fine-tuned model and tokenizer, to an artifact store. ZenML artifact store uses storage on cloud platforms to store the outputs from every step as an artifact.
Deployment LLM pipeline: This pipeline consists of 2 steps required for deploying our LLM. We use Seldon Core for both deploying and serving our LLM.
- fetch_trained_model: This step is specific to ZenML. The output of this step is the location of the fine-tuned model and tokenizer from the ZenML artifact store, detailed above.
- deploy_model: We use an advanced approach to deploying our LLM on Seldon Core using ZenML. We create a custom deployment class and a custom ZenML step that provisions an endpoint for our deployed LLM.
All the pipelines use ZenML for workflow orchestration, which in turn uses the kubernetes cluster provisioned on Microsoft® Azure for running the pipelines. That same cluster supports deployment with Seldon Core. Finally, we created a Streamlit application that uses the provisioned endpoint for querying and fetching a summary for an input legal text.
All the code for our LLM example can be found on our matcha-example Github repository here : https://github.com/fuzzylabs/matcha-examples
Matcha to fine-tune and deploy LLM example
You must be wondering what made it possible to provision all of the resources required for running this LLM example on Microsoft® Azure. Is this a fairy tale? No! That’s where our new open source tool, Matcha, really shines. Matcha strips away the complexity involved in provisioning all of the resources required for this example. In fact, it’s so easy that you can follow along in 10 simple steps:<pre><code>git clone git@github.com:fuzzylabs/matcha-examples.git
cd matcha-examples/llm
python3 -m venv venv
source venv/bin/activate
pip install matcha-ml
matcha provision
./setup.sh
python run.py --train
python run.py --deploy
streamlit run app/llm_demo.py
</code></pre>
For a little extra color, the readme for ‘LLM example’ on our matcha-example repository provides a step by step guide to recreating this yourself. It can be found here.
Matcha for LLMOps
LLMOps (Large Language Model Operations) is a catchy new term that describes the process, philosophy and culture by which we are able to deploy and maintain LLM applications in production. The space of developing LLM applications is rapidly evolving but not much attention is focused on the practical challenge of building real, production-ready LLM systems reliably. LLMOps rests upon many familiar MLOps concepts:
- Model experimentation: A typical ML application requires tweaking different hyperparameters during experimentation. It is essential to keep track of all these different experiments to get an optimal model.
- Monitoring and logging: This involves tracking various system metrics such as CPU usage, latency, throughput; and ML metrics such as performance and accuracy.
- Infrastructure Management: LLMs are expensive to train and maintain, requiring a large amount of compute. Special care should be taken to manage, monitor and maintain these resources.
- Model inference and serving: LLMs, due to their size, bring an extra dimension of complexity in the resources required for serving large LLMs.
- Model governance: This involves tracking model and data lineage, in addition to model and data versioning throughout the lifecycle of an LLM application.
- Automation: Based on monitoring, a trigger should automatically retrain, deploy and serve the model. Every step of this process should be automated.
Matcha helps developers provision and destroy the necessary infrastructure required for any ML application. It also offers additional functionality, by provisioning open source tools such as MLflow for experiment tracking, where any developer can simply use the provisioned endpoint in their application. Model serving is supported using Seldon Core which is provisioned as part of infrastructure. There are some exciting features that we will be releasing in the next few releases such as collaboration where one environment can be shared across teams to run experiments. Some other features like model governance and monitoring are planned on our roadmap. You can view our public roadmap here.
If you encounter a bug, or find a missing feature, please let us know by raising an issue on Github.
Conclusion
We demonstrated how the Matcha tool can be used for fine-tuning and deploying an LLM. We are excited about the LLM landscape and will be releasing a series of blogs in coming weeks. Next up, we will be releasing a blog introducing both open source LLMs and the overall LLM landscape for you to quickly get up to speed. Following that, a blog detailing the fine-tuning process in context of LLMs. Finally, a blog on in-context learning for LLMs. Stay tuned.