This blog is part of our MindGPT series. You can find out more about MindGPT here.
We're working on creating a specialised language model that can effectively summarise mental health information from two popular UK sources: the NHS and Mind websites. Our goal is to provide an accessible and transparent view of our progress for anyone interested in following along. The GitHub repository containing all of our code is available to view at any time.
In this blog, we will be focusing on the dark art of “prompt engineering”, a term that has popped up with the rise of large language models.
What is prompt engineering?
Prompt engineering is the term given to constructing and structuring a prompt in order to achieve desired AI outputs from a given model, similar to providing a clear recipe for a chef to follow. It's a crucial skill for making LLMs perform specific tasks, such as question-answering, or logical reasoning.
A prompt is an input or instruction for an LLM. The format and structure of the prompt depend on the task. It may require context or background information. Feedback and adjustments to the prompt are often necessary to get the desired AI response. Constructing prompts also involves considering factors like domain expertise, prompt length, and language tone. Ethical and safety concerns are important too.
Why do we need prompt engineering?
Prompt engineering is about systematically finding an optimal prompt which produces the desired output in the majority of cases. We need it because otherwise, it's very difficult to predict and control an LLM's behaviour. As well as influencing the task that the LLM will perform, it allows us to optimise the quality, clarity, and accuracy of the LLM's output, and squeeze out the best performance.
For our use case, in-context learning, it is essential to structure our prompts. The general idea for in-context learning is that the “context”, various relevant pieces of text retrieved using a similarity search based on the users' input (see our blog on vector databases), is appended to the LLM input alongside the question. Without a well-defined structure, we could risk losing the question within the context, the model may not understand what to do with the information given to it and hallucinate or if a question exists within the context the LLM could answer the wrong question.
How have we done it in MindGPT?
In order to answer a user's question, we need three things: the question, some context, and a description of the task we want the LLM to perform, e.g. "answer this question using this context".
We combine these three things into a prompt template, which gets sent to the LLM. This allows us to combine the context and prompt while also taking precautions to prevent prompt injection attacks.
MindGPT is fundamentally trying to perform a question-answering task, luckily for us, this is a task that the majority (if not all) of the major open-source LLMs have been directly trained on. This makes our life a bit easier as the LLMs should have a very good ‘understanding’ of the task.
To understand what instructions we needed to give the model we needed to evaluate different prompt templates, varying from simple to complex to measure the difference in the model response. A subset of the prompts we used initially are as follows:
Simple:
<pre><code>Context: {context}\n\nQuestion: {question}\n\n</pre></code>
Intermediate:
<pre><code>Use the following pieces of context to answer the question at the end.
If you don't know the answer, just say that you don't know, don't try to make up an answer.
Use three sentences maximum and keep the answer as concise as possible.
{context}
Question: {question}</pre></code>
Complex:
<pre><code>You are a highly skilled AI trained in question answering.
I would like you to read the following text and answer it with a concise abstract paragraph. Use the following pieces of context to answer the question at the end.
Aim to retain the most important points, providing a coherent and readable answer.
Please avoid unnecessary details or tangential points.
{context}
Question: {question}</pre></code>
For each test we performed only the template changes and the values ‘{question}’ and ‘{context}’ are placeholders/variables for the question and fixed context respectively.
In the development process of MindGPT, a fundamental principle was maintaining consistency across various aspects to ensure robustness and reliability. To test the performance of different prompts we required a number of variables to be fixed. Without this, it would be impossible to correctly attribute the change in performance to the specific prompts. Some key points in our testing and evaluation process include:
Consistency in Testing Questions
We began by carefully selecting and maintaining a consistent set of questions for testing purposes. This set of questions served as the foundation for evaluating the performance and capabilities of MindGPT. This consistency allowed us to make meaningful comparisons between different model iterations and track progress effectively.
Fixed Input Context
To minimise variables and maintain a stable environment for testing, we fixed the input context for each question. This means that the information provided to the model before presenting the question remained constant. This approach ensured that the model's responses were primarily influenced by the question itself rather than variations in the input context.
Uniform Model Hyperparameters
Model hyperparameters play a crucial role in shaping the behaviour of language models like MindGPT. To ensure a fair evaluation and comparison, we kept hyperparameters, such as temperature (which controls the randomness of responses) and max output length (to limit response length), constant across all experiments. This standardisation allowed us to focus on the intrinsic capabilities of the models without external factors affecting the results.
Each question was tested using each combination of model and template and the outputs were captured and looked at by our team members. This method is by far the quickest way to set up an evaluation of the templates, however, its downsides are that it is completely subjective and this can lead to bias in the responses.
For more information on our experimentation method, you can see our notebook on our MindGPT GitHub repo.
What have we learnt about prompt engineering
There is no ‘silver bullet’ prompt
The main outcome of our testing is that no single prompt template works the best across all the models we have used. The prompt needs to be finetuned for a single model. This can even be true for model families e.g. Llama-2-7B and Llama-2-13B.
Small prompt changes can have a big impact
Even minor changes in the prompt, for example re-ordering the instructions for the LLM, can lead to large changes in the outputs, LLMs are generally sensitive and become more sensitive to inputs as the model size increases. This could be explained by the fact that the model can capture and extract more information leading to more capacity to focus on nuance in the input text.
Use prompt engineering only on the model you will be using
In general, a bigger model will produce better results so putting a lot of time into fine-tuning prompts may be a waste of your resources. Fine-tuning should be performed on only a model that you will use as the prompts performance may not generalise well across models. If your model is regularly updated (e.g. ChatGPT/GPT 4) this means that your prompt performance may degrade (or the opposite) over time with the updates. This is one reason why open-source models are advantageous, there’s no time pressure to migrate from deprecated or continually upgrading models.
Read the model’s guide for prompting
Some models have requirements for their inputs based on how they were trained, for example, Llama-2 requires a special token `[INST]` or `<<SYS>>` to format the model’s input. Models like this are generally trained in this specific format and will perform best when their inputs are similar to the data they were trained on. Therefore, you should read the documentation for the model you are using; their creators want you to get the most out of their model!
Keep experiments objective and consistent
We strongly recommend that you make your experiments as objective and consistent as possible. If you do not, you can end up going down rabbit holes on prompt engineering leading to a lot of wasted time. If you require the use of a smaller model you can leverage larger models to evaluate your LLMs outputs meaning human evaluation is required less. However, a large amount of caution is required if you take this approach as this can accentuate biases within the models.
You should also check the maximum token input length of your model to ensure that you are not exceeding this limit on your inputs. If you do, this will likely lead to the model truncating important information from your prompt. The output will degrade leading to any optimisations in your prompt being made redundant.
What difficulties did we face with prompt engineering?
The testing and feedback loops feel very non-scientific for evaluating large language model outputs. The majority of our testing was subjective, for example, showing the output for each prompt to our team members and each person choosing the best to select the best output. In cases where we did not have the time/resources to do so, we just used a single person's subjective opinion.
Ideally, your set of input questions (for question answer evaluation) should have a ‘gold standard’ answer that can be compared to the output. Keep your eyes peeled for our blog on LLM evaluations for more information.
Even when keeping a large number of variables constant, the search space of these variables was still massive and this could have easily led to losing track of the experiments and objectives during the prompt-engineering stage.
Other types of prompting
In our work on MindGPT, we mainly focused on using single prompts. However, there are other methods worth mentioning that can improve model performance: Few-Shot Prompting and Chain-of-Thought Prompting, both of which can be useful for shorter context windows. We’ll be releasing a blog on these in the near future.
Few-Shot Prompting
Few-shot prompting is about giving the model a short prompt or instruction to generate text on a specific topic or style. The model uses this prompt to understand what's needed and generate a response. Unlike some other methods, few-shot prompting doesn't involve a sequence of prompts; it relies on a single prompt to guide the model's response. Sometimes, this prompt includes previous examples of questions and answers to help the model mimic the response style.
Chain-of-Thought Prompting
Chain-of-thought prompting involves providing a series of related prompts or questions one after the other, along with explanations in natural language. This approach provides an example question and answer with the logic to answer the question and then attempts to break down a problem into smaller steps before arriving at a final answer.
In a paper called "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" (available at https://arxiv.org/pdf/2201.11903.pdf), it's shown that chain-of-thought prompting can improve a model's performance when dealing with tasks that have multiple steps.
However, it's important to note that for a question-answering chatbot like MindGPT, using chain-of-thought prompts might not be the best fit. The reason is that in chatbot conversations, users usually expect quick and clear responses. Complex prompting methods like this could disrupt the flow of the conversation and lead to longer and more complicated answers. In such cases, it's crucial to prioritise user-friendly and clear interactions to meet user expectations for simplicity and efficiency.
What's next?
In this blog post, we’ve looked at the use of prompt engineering and its pitfalls. In the next blog in the MindGPT series, we will look at implementing the Guardrails package into MindGPT which is used to restrict the output to prevent potentially harmful LLM outputs.