If you have any familiarity with machine learning, you know that a large part is evaluating a model’s performance on data that it’s not seen before. This allows you to gauge how well the model will perform once deployed into a system, but how does this translate to large language models?
Evaluation in an LLM context is more complex as there are multiple components at play. If we take a traditional setting, you have the test dataset, a trained model, and an evaluation metric that you’re optimising (there could be multiple). Whereas, in an LLM system, there’s the model, hyper-parameters (e.g., temperature), retrieved context, and prompt template, so there’s a need for a comprehensive set of evaluate metrics to keep track of the system performance when one of the components is changed.
In this blog, we’ll be focusing on why it’s important to evaluate LLMs, what the common techniques are, and demonstrate how you might do it in a worked example. We’ll also dig a little deeper and discuss more advanced techniques.
What are the common LLM evaluation techniques?
There are two general approaches to evaluating LLMs: the use of benchmarking datasets and computing metrics. We will primarily focus on the latter in this blog, but will briefly cover a few popular datasets. LLM is a fast growing field, so as you can imagine, there are many benchmarks dataset when it comes to evaluating large language models. MMLU and HellaSwag are two notable examples that have been used to evaluate many state-of-the art LLMs, such as GPT-4 and Llama2.
The MMLU dataset covers 57 subjects, from elementary-grade maths to law and ethics, and it's used to assess how accurately the model is able to answer the example questions. On the other hand, HELM is a dataset that is designed to assess the model's performance based on a variety of different metrics.
Metrics behave in a similar fashion to traditional Machine Learning metrics in that they give you a numerical score that describes how the model performs relative to some benchmark or heuristic. These can be categorised into two groups: context free and context dependent.
Context Free
These are metrics that evaluate a model based on predetermined benchmarks, for example, accuracy on the MMLU benchmark. The output generated by the model is only compared with the provided gold references. As they’re task agnostic, they’re easier to apply to a wide variety of tasks. The downside of these metrics is that they may not accurately reflect real-world performance in specific applications.
Context-dependent
There are metrics that evaluate the performance of the model in the specific context in which it will be used such as BLEU (Bilingual Evaluation Understudy) score. These metrics can provide better insights into how a model will perform in its intended application. We will focus on the BLEU score in the next section, providing an example of how to calculate it.
Using BLEU to evaluate an LLM
Let’s dive into the details of the BLEU score and how we can use it to evaluate our LLM, but first, what is BLEU?
It’s a metric for automatically evaluating machine-translated text. The score ranges between 0 and 1 and measures how similar the machine-translated text is to a set of high quality reference translations:
- A value of 0 means that the output has no overlap with the reference hence the quality is low.
- A value of 1 means that there is perfect overlap with the reference data. So how can we apply this metric to evaluate the quality of llm’s output.
When we design a LLM system, it’s important to have a set of benchmark questions with “good quality” answers that we can use to evaluate our LLM. These benchmark questions would serve as our reference data, similar to a test data set in more traditional machine learning tasks like image classification.
Let’s take a look at some code examples, imagine a scenario where we are evaluating a LLM that is able to summarise health information - similar to our MindGPT project.
<pre><code>from transformers import pipeline, set_seed
set_seed(42)
def create_pipeline(
task = 'summarization', model = 'facebook/bart-large-cnn', tokenizer = 'facebook/bart-large-cnn'
) -> pipeline:
return pipeline(task=task, model=model, tokenizer=tokenizer)
summarizer = create_pipeline()</code></pre>
For this example, we have a reference dataset in the form of a text file which will serve as our benchmark dataset. In this case, the dataset information from the NHS website which defines what depression is and we’ll ask the LLM to summarise this information to something shorter and, hopefully, more concise.
<pre><code>def read_information(
file_name = 'depression_information.txt'
) -> str:
return open('depression_information.txt', 'r').read()
data = read_information()</code></pre>
To compute the BLEU score, we will utilise the evaluate library from HuggingFace which implements the BLEU score.
<pre><code>import evaluate
result = summarizer(data, max_length=70)
summarised_text = result[0]['summary_text']
bleu = evaluate.load('bleu')
scores = bleu.compute(predictions=[summarised_text], references=[data])</code></pre>
The code above computes the BLEU score for our summarised text against the references data. Below is the model output and our reference data.
<pre><code>
Summarised text: Depression is more than simply feeling unhappy or fed up for a few days. Most people go through periods of feeling down, but when you're depressed you feel persistently sad for weeks or months. Some people think depression is trivial and not a genuine health condition. They're wrong - it is a real illness with real symptoms.
Reference data: Depression is more than simply feeling unhappy or fed up for a few days. Most people go through periods of feeling down, but when you're depressed you feel persistently sad for weeks or months, rather than just a few days. Some people think depression is trivial and not a genuine health condition. They're wrong - it is a real illness with real symptoms. Depression is not a sign of weakness or something you can "snap out of" by "pulling yourself together". The good news is that with the right treatment and support, most people with depression can make a full recovery.
Summarised text BLEU score: 0.421
</code></pre>
You can see that the BLEU score is 0.421, which is good: a score between 0.4-0.5 represents a high quality output. As a reference, a score above 0.6 is often considered to be better than a human.
If we were to compare a random sentence against our reference, we would expect a score of 0, meaning that there is no overlap between the sentences.
<pre><code>random_text = 'This should produce a poor score'
random_score = bleu.compute(predictions=[random_text], references=[data])</code></pre>
We would expect that the random text above to produce a low BLEU score.
<pre><code>Random text: This should produce a poor score
Reference data: Depression is more than simply feeling unhappy or fed up for a few days. Most people go through periods of feeling down, but when you're depressed you feel persistently sad for weeks or months, rather than just a few days. Some people think depression is trivial and not a genuine health condition. They're wrong - it is a real illness with real symptoms. Depression is not a sign of weakness or something you can "snap out of" by "pulling yourself together". The good news is that with the right treatment and support, most people with depression can make a full recovery.
Summarised text BLEU score: 0.0</code></pre>
As expected, since the random text is completely unrelated to our reference data, we have a score of 0. Now you’ve seen one way of evaluating the performance of an LLM, but some of these metrics have been around since before LLMs, so what are the more advanced methods? We’ll explore this next.
What other methods exist for evaluating LLMs?
Going beyond conventional evaluations, like the one we’ve just worked through, more recent approaches have taken to using a ‘stronger’ LLM, such as GPT-4, as an evaluator. An example of this approach is the G-Eval framework. The idea behind this approach is to provide the evaluator with a task introduction (i.e., this is what you’re going to do) and an evaluation criteria. From there, the evaluator LLM is asked to generate Chain-of-Thought evaluation steps, which essentially includes not just the judgement but also the logic behind that judgement.
As a more concrete example, in the task introduction, you would prompt an LLM such as GPT-4 with: “You will be given a sentence generated by an LLM. Your task is to rate the sentence.” Alongside this, an evaluation criteria is also included, which in this case is a measure of quality: “Coherence (1-5) - the collective quality of all sentences”. Finally, the evaluation steps are also included: “first read the sentence, compare it to the reference dataset, and assign a score for coherence.”
Conclusion
Building solid evaluations procedures should be the starting point for any LLM-based system. Unfortunately, it’s not plug-and-play; not all existing benchmarks and metrics work across all LLM use-cases. Thus, instead of using off the shelf benchmarks, we can start by collecting a set of task specific evals such as prompt, context, and expected outputs as references. These types of evaluation approaches will guide prompt engineering or model selection and so on, which will ultimately improve your LLM system performance.
What's next?
In this blog, we’ve explored how evaluation differs between large language models and traditional Machine Learning, discussed some of the common approaches, worked through an example of evaluating a text summarisation LLM, and examined more advanced techniques. Following this, we’ll focus on how you can control the input and output to your LLM with guardrails and how we’ve applied them in the MindGPT project.