Agile data science: Evaluation and baseline model

Infrastructure that enables rapid prototyping for model development

Tuesday 19 Nov 2019


When you work on a data science problem, are you jumping straight into the cool machine learning models? You are probably keen to play with some of the popular algorithms, like deep learning, straightaway and see how they perform on your dataset. In this article we look at why that may not be a good idea, and how we structure work on a machine learning project to get the most out of modelling. This process has worked for us across different data science projects, from time series modelling to writing AI agents in a simulation environment.

In this article, we look at the ingredients of the agile data science modelling process:

  • Evaluation platform allows objective assessment of a machine learning model.
  • Baseline model provides basis for comparison.

Together, these form an infrastructure that allows rapid prototyping and model development. Below we look at these steps in more detail to see how together they form the basis of the agile data science method.

Evaluation platform

The evaluation platform is an automated way to answer "How well am I doing?" when fitting a machine learning model. It should be the first step in creating the infrastructure for a data science project. In short, the evaluation platform takes a machine learning model and computes a performance metric of how well the algorithm is performing on a hold-out test dataset. For an evaluation platform to be effective, it should satisfy the following criteria:

  • Return a single number.
  • Be fully automated end-to-end.

It should return a single number to make comparisons of competing models straightforward. It should be fully automated to ensure it actually gets run every time a model is changed - ideally a bash script or makefile to evaluate the performance of an algorithm.

The evaluation platform may be as simple as calculating the mean squared error between a model's predictions on a test set and the true labels. Separate the error calculation step from the model fitting, automate it in a simple script, and you have a basic evaluation platform.

However, in many interesting real-world problems, the evaluation may not be as straightforward. In one of our past projects, we worked with causal time-series datasets that had neither a straightforward split into training and test set, nor a clear evaluation of performance without knowledge of the counterfactual. An advantage of an explicit evaluation platform is that it promotes discussion of what counts as success, especially in such complex cases.

Also, be aware that the quantitative evaluation criterion is typically only an approximation of the actual real-world success criterion. If not well designed, by optimising its value you may not achieve the actual true objective.

The actual implementation of an evaluation platform depends on your use case. Again, it may be as simple as a bash script that loads model outputs from a csv file and runs a mean squared error function on them. Or it may be more complex, from connecting to a database, spinning up VMs, running containers in the cloud, or running programs on a cluster. The important point is that it should be automated and simple to run as often as necessary.

We can almost treat the evaluation platform as tests in software development - only now the evaluations get run on machine learning models every time they are changed, and provide a numerical measure instead of a pass-fail response. Ultimately, the evaluation platform keeps modelling efforts honest, with explicitly defined success measures.


Baseline model

The second ingredient of agile model development is the baseline model. This is the most basic, almost stupid model that can be applied to solve the machine learning problem at hand. Such a model can then serve as the baseline for evaluating more sophisticated models, and provide a clear basis for comparison.

Baseline model should be a simple algorithm, the simplest one that can be applied to the task. For example, the nearest neighbour algorithm can be applied to both classification and regression tasks without much effort, yielding a reasonable performance out-of-the-box. Or it may not even involve any machine learning at all: a random sample can provide a good sanity check. Or it may involve a more complex example: like when predicting events with a weekly period, the baseline may be to use "what happened at the same time last week" as a prediction. Whatever model you choose, it should be easy to fit without creating additional complexity.

Overall, a good baseline model provides:

  • Better understanding of the nature of the task.
  • Baseline performance.
  • A way to run the evaluation process end-to-end.

Once we run the baseline model through the evaluation platform, we get a single number that quantifies the performance we can achieve with the simplest approach. It may be good enough already - or it can serve as something to beat.


Agile model development

Agile Data Science

The evaluation platform and the baseline model form the basis of the agile model development in data science, allowing fast iteration and progress. The baseline model gives us an initial threshold to improve upon. The evaluation platform provides a clear performance measure. Now we are ready to create more interesting machine learning models while explicitly tracking their performance.

Because agile is such an overloaded word in software development, we should specify what we mean by agile in this context. Here, agile represents an iterative process of creating increasingly sophisticated machine learning models with fast feedback provided by the evaluation platform. Fitting machine learning models is often a playful curiosity-driven activity - the evaluation platform with the baseline enables productive experimentation with clear performance checks.

Although everyone wants to play with the cool new algorithms, it's a good idea to start with simpler models and then iteratively improve them. Also the actual model development doesn't have to be just incremental - we can fit a wide variety of models. We only need to make sure to evaluate them consistently using the evaluation platform.

By using this workflow, you will always have a set of models with clear assessment of their performance. This is a data science equivalent of always having a "working software" in agile software development.

People familiar with machine learning competitions at sites like Kaggle will recognise the same principles in our data science workflow. Kaggle uses clearly defined questions with specific evaluation criteria, automated evaluation process and a script with a trivial algorithm to show example usage. This recipe creates a low bar for entry into a machine learning competition, and allows iteratively improving the models used.

Indeed, the same principles have proven to be successful in machine learning research. Linguist Marc Liberman calls this the Common Task Method, and David Donoho calls this process one of the secret sauces of data science in his 50 Years of Data Science paper. The basis of the Common Task Method is to have a public training dataset for a predictive task, with an objective scoring referee that evaluates submissions against a held-out test set. By having multiple teams sharing the same common task with an explicit evaluation and competing to get the best performance, this process leads to a stable progress with small improvements over time. The same process is behind many of the recent successes in deep learning, such as image data from ImageNet with their associated competition. In our experience at the Turing, it is worth replicating a similar workflow even within an internal project.


Laying the groundwork for a data science project

The described workflow of an explicit evaluation platform and together with a baseline model comes from our experience of working on data science projects here at the Turing. Some of the points in this article may sound obvious - for example testing and validation of machine learning algorithms is one of the first things you learn when you get into data science - but making this step an explicit part of the project workflow enables us to spend time on formulating it properly and automating it.

Working on such an infrastructure for future machine learning work may not look very productive from the outside. Stakeholders can often push for flashy data science results to be produced as soon as possible. But it's worth spending time laying the groundwork in the form of the evaluation platform and baseline model to allow fast and productive iterative work in the future.

Both the evaluation platform and the baseline model should form an important part of the data science workflow. As data science establishes itself as a standalone discipline, this should be considered an integral part of the job, not just a 'nice-to-have'.