No, we’re not talking about going to Mars or Jupiter. In this blog, we will briefly discuss the ideas of embeddings, how to create image embeddings with pre-trained models, what they are used for and what MLOps tooling we can leverage to help us create production systems using image embeddings. It’s not rocket science!
What are Embeddings?
Embeddings are a way to represent high-dimensional data in a lower-dimensional space, where each dimension corresponds to a meaningful feature of the data. Embeddings ideally capture the semantics of the input by grouping semantically similar inputs closely together in the embedding space.
A popular example of capturing semantics, which you have probably seen before, is this case of word embeddings:
In this example, we can see that the distance and direction between the ‘man’ and ‘woman’ representations are roughly equal to that of ‘king’ and ‘queen’. These vectors exist in what is called an embedding space or latent space.
Embeddings can also be generated for images, which is beneficial in various machine-learning tasks. Image embeddings are useful because it allows us to represent high-dimensional image data in a more compact and meaningful way. An image embedding is a vector of numerical values that captures the essence of an image and can be used as input to a machine learning model or algorithm for various tasks such as image classification, object detection, image search, and image generation.
In an embedding space, the distance between two embeddings generally indicates how similar they are. Smaller distances mean greater similarity. However, the specific interpretation of distance can vary depending on the embedding and the context in which it is used. There are different ways to measure distance, such as Euclidean or cosine distance, which can lead to different interpretations of similarity.
In a visual embedding space, two images that are close together might have similar visual features or depict similar objects or scenes.
Encoder-Decoder Models
How are these embeddings actually generated? One method is by using an encoder-decoder model. This is a type of neural network that can compress input data into a smaller representation, called an embedding. The model consists of two parts: an encoder and a decoder.
The encoder takes in the input data and applies a series of non-linear transformations, such as convolutional layers or self-attention layers, to compress it into a smaller representation. This compressed representation is the embedding.
The decoder then takes this embedding and applies a series of transformations, such as deconvolutional layers or self-attention layers, to reconstruct the original input.
The output of the encoder is obtained at the "bottleneck", which is the compressed representation of the input data. This compressed representation can be used for various purposes such as data compression or feature extraction.
We want to obtain a compressed representation for images for our use case, but what deep learning architectures can we use for image data within this encoder-decoder model?
Convolutional Neural Networks
The first architecture we will look at is the Convolutional Neural Network (CNNs). These are the most widely used techniques for image embeddings. CNNs use convolutional layers made up of filters that slide across the input image to identify patterns and features. An activation function, like ReLU or sigmoid, is then applied to the output. The resulting feature maps from each layer are passed to the next layer, allowing the network to learn more complex features. Finally, the output is flattened and fed into a fully-connected layer that produces the final embedding.
Some examples of pre-trained CNNs that can be used to generate image embeddings include (but are not limited to), VGG, ResNet and EfficientNet.
Vision Transformers
In recent years, Vision Transformers (ViT) have emerged as a promising alternative.
These are an adaptation of the popular transformer architecture for images, with the "Attention is All you Need” idea behind it. ViTs process images by breaking them down into a grid of patches, which are flattened and transformed into a sequence of tokens. These tokens are then processed by a transformer architecture, which uses self-attention to focus on important parts of the sequence. The output sequence is pooled to produce a fixed-size representation of the image, the embedding.
Transformers also have the advantage of allowing variable-sized inputs and have been shown to achieve better classification accuracy than CNNs in some cases, however, to achieve comparable/better performance than CNNs a longer training process and more data are required. CNNs have also been around for longer and hence we know more about them.
Some examples of pre-trained Vision Transformer models include ViT (from the first paper that successfully trained a Transformer encoder on ImageNet, attaining comparable results to familiar convolutional architectures) and ViTMAE which builds on the ViT model but uses the concept of ‘token masking’ (which is popular in word embedding models) to train the model.
Creating Embeddings for Artwork
Creating embeddings for artwork can provide valuable insights and applications. By representing art in a lower-dimensional space, embeddings can capture the semantics of art and group similar pieces of art together in the embedding space. This can be useful for various tasks such as recommender systems and fraud detection.
In this example, the goal is to create a vector representation of a piece of art which allows us to measure how similar two or more artworks are. The first idea is to take the image of a piece and use a pre-trained image encoder with either a CNN or ViT architecture as discussed in the previous section.
A pre-trained image encoder is an encoder-decoder model that has already been trained on a large dataset of images to extract meaningful features from images. This gives the advantage of saving time as we do not have to train our own model.
We can also fine-tune this model with our specific data if we need to. This approach allows the model to generalise better to new data and can improve performance, especially when the target dataset is limited.
When extracting embeddings from these models we can remove the ‘decoder’ section of the model and the model’s output will be a vector for a given image.
Many pre-trained autoencoders exist such as ViTMAE on HuggingFace. Using these we can simply pass the model an image and receive a vector representation from it.
Image embedding improvements
Does it make sense, however, to use encoded images without any additional context? Is there a relationship between the content of the artwork (the image itself) and the demand for the art or artist, or is the demand driven by external factors such as its provenance, history and the current market conditions?
I’m sure you’ve heard at least one person say “a child could have drawn that" about a piece of abstract art that has gone for millions at auction. Therefore, the value of a painting (ignoring political factors and economic advantages for the buyer) is not determined only by its contents but also by external factors such as who made it, when it was made, and the meaning behind it. A person that buys a Gerhard Richter painting (see the photo below) is unlikely to buy a child's painting no matter how visually similar the paintings are.
Furthermore, not all art is static images. What do we do in the case that the artwork is a sculpture or other medium? Therefore adding additional features to the embedding is essential to capture the correct relationship for other tasks such as recommendation systems.
We can add additional features, assuming they are numeric or have been transformed into numeric features (e.g. categorical features using one-hot encoding) by concatenating them to the end of the embedding vector.
Embedding considerations
Depending on the features added the data can become sparse (sparse data refers to a data set in which a large portion of the values are missing or zero). When working with very sparse data, some similarity metrics such as Euclidean distance can break down and not work as effectively so caution should be used when adding features. Having more features is not always better.
How can we evaluate the quality of embedding spaces?
The quality of an embedding space can be evaluated based on its ability to capture and represent the underlying structure and relationships in the data. Here are some common ways to evaluate an embedding space:
- Visual inspection: One way to evaluate the quality of an embedding space is to visualise it and see if the embedding clusters similar items together and separates dissimilar ones. Although the embedding vectors will most likely have hundreds of embeddings, we can use techniques such as Principal Component Analysis (PCA) to reduce the dimension to two or three dimensions and visualise the points on a 2D or 3D graph. A good embedding space should have clear and distinct clusters of similar items.
- Nearest neighbour evaluation: Another way to evaluate the quality of an embedding space is to use a set of test data and compare the nearest neighbours in the embedding space to the nearest neighbours in the original data. A good embedding space should have similar nearest neighbours in both the embedding space and the original data.
- Extrinsic evaluation: Testing embeddings using downstream tasks that are not the intended users of the embeddings (extrinsic evaluation) is another effective way to evaluate their quality, as it provides a more objective measurement of their effectiveness in a specific task. By comparing the performance of a downstream model using the embeddings against a baseline model using raw data, we can determine whether the embeddings provide any improvement in downstream task performance. These downstream tasks can include image classification or image retrieval.
Note that the first two of these methods require manual inspection of the embeddings which can be time-consuming. Hence, we can see that evaluating embeddings is not a straightforward task.
Applications to Recommender Systems
A recommender system is a type of software or algorithm that suggests items, such as products, movies, or music, to users based on their preferences or behaviour. The goal of a recommender system is to help users discover new items that they are likely to enjoy and to provide personalised recommendations that reflect their unique tastes and interests. Once we have vector embeddings for each item we have a strong base to build a recommender system.
Cold Start problem
One problem that arises when implementing a recommendation system is the "cold start problem." This occurs when a new product is added to a recommender system, and there is not enough data available to make accurate recommendations. In these cases, the system may resort to making random recommendations or using a default recommendation until more data is collected.
One potential solution to this problem is to use content-based recommendations, where the system analyses the characteristics of the new product and recommends similar items based on those characteristics. This is where our embeddings come in, when adding a new item to the system we have a way of representing the product without existing interaction data.
On top of a content-based recommendation system (using the embeddings), we can add collaborative filtering methods to make more informed recommendations while avoiding the cold start problem for new items.
Where Does Machine Learning Operations (MLOps) Fit In?
Embeddings are great but if we don’t have MLOps infrastructure in place it can be hard to leverage their full potential.
When it comes to scaling embeddings up to large volumes of data and using them in a deployed service we need to make sure that our systems are robust, reliable and reproducible. This is where MLOps comes in. Just a few specific use cases for MLOps in embedding creation and recommender systems include
- Data pipeline: A data pipeline should be designed to retrieve data from various sources and preprocess it in a consistent manner before feeding it into the encoder model to obtain embeddings. This data pipeline should also contain some data validation which can either be created manually or using a tool such as Great Expectations or DeepChecks depending on the task.
- Versioning: It can be important to version the embeddings and the pre-trained language model used to generate them to ensure reproducibility and consistency across different deployments. This is where an experiment tracking tool such as MLFlow can be useful as a central place to store this information.
- Deployment: Embeddings and recommender systems can be deployed in a variety of ways, including as part of a containerised application or as a service via an API. The deployment process should be automated and integrated with a continuous integration/continuous deployment (CI/CD) pipeline to ensure that changes are tested and deployed consistently and reliably.
- Pipeline testing: Without robust data and training pipelines you are likely to run into several issues when it comes to deployment. To avoid some of these issues unit testing each step and every function within a pipeline is essential. This means that every function and step does exactly what you expect it to do and allows you to update pipelines without breaking them. Depending on the size, pipelines can take a while to run, adding unit tests speeds up development and mitigates failures mid-pipeline runs.
- Feature stores: Feature stores are a powerful tool in MLOps that allow for the efficient storage, retrieval, and sharing of features, including embeddings. A feature store is a centralised repository that stores pre-computed features, including embeddings, that can be used by multiple models. This can reduce the time and resources needed for training and deploying models, as the features don't need to be computed each time a model is trained or deployed. In addition, feature stores can help ensure consistency and reproducibility of feature computation across different models and versions.
Conclusion
We’ve discussed embeddings and how they can be used to represent high-dimensional data in a lower-dimensional space. Specifically, we focused on image embeddings and encoder-decoder models, such as CNNs and Vision Transformers, to create them. We also explored the application of embeddings to artwork, where they can be used for recommendation systems and a brief look at the types of MLOps tools that can ensure success when generating embeddings.
In a later blog, we will focus on the application of MLOps tooling to recommender systems and explore in more detail how these tools can be used to optimize the development, performance and scalability of recommender systems. We will dive deeper into specific MLOps tools and techniques, such as model versioning, automated deployment, pipeline testing, and feature stores, and discuss how they can be applied to various recommender systems.