What is data for ML models?
The basic idea behind machine learning is that it can learn patterns from past data in the same way that humans learn best from examples. Because the data used to train your model is a simulation of real-world data, the accuracy of your model is directly proportional to the quality of your training data.
What exactly is data drift, and why should you be concerned?
Imagine you have a model to estimate house prices in Hong Kong based on the size of the house and the number of bedrooms it has. It was trained on a set of data collected in 2012 and 90% of houses have a size of around 1000-square-feet and 2 bedrooms. Its 2022 now, 90% of houses now have a size of around 500-square-feet and 3 bedrooms. Data drift refers to changes in the distribution of the input data. These changes may indicate to a new relationship between the model's features and outcomes, and as a result, the model can no longer make accurate predictions.
Data drift is likely to be the reason that your model’s accuracy degrades over time. It is the deviation in data used during inference from training data.
NOTE: Data drift is also known as feature drift, input drift, or covariate shift but they essentially mean the same thing.
Is it okay if I pretend it doesn't exist?
Monitoring data drift helps preventing model performance issues. A wrong movie suggestion to a Netflix user is less harmful than a wrong property value estimation. If data drifts are not identified on time, predictions will be incorrect, and business decisions made based on the predictions may have a negative impact and result in revenue decreases.
How to compute data drift?
To calculate data drift, we must first understand what data drift means. People frequently confuse outliers and drift, so what is the distinction between the two?
When we talk about drift detection, we look at the “global” data distributions in the whole dataset.
Whereas an outlier is an individual object that appears to be “unusual” in the input data, also known as anomalies.
The simplest approach to detect data drift is to use statistical tests to compare the distribution of features from the training data to the distribution of features during inference. Some of the popular statistical methods to calculate the difference between any 2 distributions are Kolmogorov-Smirnov Test (KS Test), Chi-squared test, Wasserstein metric and Jensen–Shannon divergence. A detailed explanation of how each of these tests works would be beyond the scope of this blog, if you are interested, you may start by looking at statistical distance.
How can we track data drifts?
Rather than implementing each statistical test from scratch and then integrating them to monitor your data in production. Using libraries like Evidently is a more convenient and efficient way to do it. It's also open source, which means it can be adapted and tailored to your specific requirements.
Data drift with Evidently
Evidently is a tool to analyse and monitor data and machine learning models. It includes tools to generate interactive reports on both data and models locally.
1. Importing Evidently: In addition to data drift, Evidently also has a range of different functionalities such as model performance monitoring, data quality monitoring and target drift drift monitoring etc. In this blog, we will focus on data drift, but if you are interested check out Evidently’s official website.
2. Loading the dataset : To perform drift detection, we will need two datasets. The reference dataset servers as a benchmark. Evidently will analyse the change by comparing the current production dataset to the reference data using an appropriate statistical test depending on your data.
3. Feature selection: Before calculating feature drift, we need to tell Evidently which features to monitor. We can do this by using the ColumnMapping class.
4. Creating a data drift dashboard: There are 3 different ways to compute and visualise data drift using Evidently at the moment. “Dashboards”, “JSON Profile” and “Monitors”. We will first look at “Dashboards”. Dashboards comes in two different version, a short and a full version which can be specified with the verbose_level parameter.
- verbose_level == 1 - Full report
- verbose_level == 0 - Short report
5. Calculate and show: After creating our data drift dashboard, we can now tell Evidently to calculate data drift for features that we have selected previously and we can display it using the .show() function.
The Evidently dashboards below shows the drifting features first. We can also choose to sort the rows by the feature name of type.
If we want, we can also save the dashboard as an html file.
Instead of displaying the results on a dashboard, a JSON profile, which is essentially a normal JSON file, can be used to compute and save the calculation results. If you are only interested in the calculation results, the JSON profile may be useful.
Monitoring in production 💻
To perform real-time monitoring, we can use Evidently's built-in monitors to collect data from a deployed ML service and compute metrics with Evidently's analyser. Evidently will be used as a monitoring service in this scenario.
The following is an example of a typical MLOps in production workflow involving a data drift monitor.
In addition to the monitoring service, we will require a method to store and visualise the calculated metrics. To visualise our metrics, we can use Prometheus as a database service, create a Grafana dashboard, scrape metrics from our database and displaying them on our pre-built dashboard. Docker Compose 🐬 would be a good choice for running all of the different components.
This blog will be too long to go into detail about how to set up the pipeline to monitor your data in production. For more details, I recommend checking out our Data Monitoring with Evidently GitHub repo. This repository includes a step-by-step guide for running each component needed to set up a pipeline to monitor your data in production.
In summary, drift detection is an important step in the MLOps cycle and should always be considered when designing your MLOps pipeline.