The source of many common problems in data science projects is placing more emphasis on modelling than data. Data science project success is defined largely by data and the complex process of getting it ready for analysis. However, project progress is commonly defined purely in terms of implementing models and analysis outcomes. The issue is that "getting data ready for analysis" is an ill defined concept, which makes it hard to plan this phase and allocate sufficient resources to it. In our experience, defining clear data-readiness goals around which the project timeline is built is key for project planning and success.
By definition, data is at the core of data science projects. Broadly, we need to acquire and access the data (often not as straightforward as it sounds), understand it and prepare it for analysis. This is the foundation that everything else rests on. As the common joke goes, data science is 80% data cleaning and 20% complaining about data cleaning.
Unfortunately, data science as a field has done a much better job of defining modelling goals than defining data-centric goals. After all, modelling is the exciting part of any project, both for the data scientist and the project stakeholders. However, there is a lot of work that needs to be done before one can even begin modelling. Trying to demonstrate progress purely in terms of modelling outcomes leads to unnecessary stress and frustration when everyone feels there is nothing to show for a long period of hard work. In the worst-case scenario, it leads to not allocating sufficient time and resources to data understanding and wrangling in favour of "just doing something", which results in working with data not suited to the question or building models not suited to the data.
The issue is that there isn't a well established workflow or vocabulary for evaluating data readiness. Despite data being the most important aspect of any project, data access, understanding and preparation is often underestimated and, as a result, under resourced. We have found adopting Neil Lawrence's data readiness framework very useful for scoping and communicating this process.
The data readiness framework
The data readiness framework proposes distinct data readiness levels or bands that define criteria that need to be met in a project before one can claim to have progressed to the next band. Broadly, the outcome of each band is:
Band C: Data can be loaded into analysis software
Band B: Data and its limitations are understood
Band A: Appropriate data are available to answer a specific question
The bands can be divided into any number of further sub-levels. For example, level C4 represents a vague knowledge that some data are available somewhere, up to C1, at which data are available and accessible and can be analysed by the data scientist. The intermediate steps this will require are generally project specific.
Defining clear data readiness goals that must be satisfied gives us the licence to allocate the necessary amount of time to this task. For example, the framework makes it clear that data readiness includes the data scientist gaining data access and data understanding. These tasks are commonly taken for granted or undervalued. Progressing through each band and satisfying these requirements should correspond to explicit stages and goals in a project timeline.
Altogether, the framework makes it easier to plan our project and monitor and communicate our progress. Saying we've already gone through band C, we're currently around B2 looking at statistical properties of the data, and soon progressing to work on band A is much more specific than saying that we're cleaning the data and will continue to do so for a while yet. And if it takes longer than anticipated, even after all our careful planning, we can more easily identify and explain why that has happened.
Data readiness challenges
Getting access to data in a usable way that satisfies security considerations is in itself a huge task and a common project blocker. The challenges are not only technical but also legal and ethical. It can be a long process that requires coordinated effort between the data science team and the project stakeholders providing the data. A good litmus test for this phase is whether data has been shared in the past, which implies that at least some of the required infrastructure is already in place.
Data science projects come from various domains, and every new domain brings new technical jargon that we have to learn. What is more, many domains use what seem like everyday words in a very specific and technical sense. This is deceptive because it is not obvious at first glance that there is something new to learn. On one project, for example, we ended up spending a great amount of time just defining what exactly we mean when we talk about a "flight". This might sound trivial but it wasn't: is flight what is represented by a flight number, or the trip between two airports, or what passengers buy tickets for, including transfers? Without agreeing on what we meant by this term, we were commonly talking about very different concepts and data even though, at first glance, it sounded as if we were talking about the same thing.
"When you ask a domain expert for a data description, more often than not they give you the 'mostly correct answer'"
Another challenge is making implicit knowledge explicit, and transferring it from the domain experts to the data scientists. When you ask a domain expert for a data description, more often than not they give you the "mostly correct answer". This skips over all the implicit assumptions and exceptions that are included in the data. A very simple example might be having duplicate rows in a database. Everyone who already works with the data knows this common knowledge and automatically filters out the duplicates. But they often forget they had to learn it in the first place and maybe they don't even realise anymore that they are applying it. This means they forget to mention it to any external data scientists.
Finally, there are occasions when people don't share some information about the data because they simply don't know it. Most importantly, they may be unaware of this gap in their knowledge.
All this means that, even when working in a familiar domain, a data scientist must always go through a lengthy process of understanding the data in order to get it ready for analysis. This involves formulating explicit hypotheses about the data and testing these hypotheses by profiling it, for example by looking at summary statistics or visualising it. We have also found it helpful to write down our understanding of the data, in plain English and/or as pseudocode, in as much detail as possible. Formulating explicit definitions and statements is effective at quickly exposing any misunderstandings and it allows us to establish common vocabulary with our collaborators and check the accuracy of our understanding with domain experts.
Conclusion
Data readiness is key to the success of any data science project. It is not sufficient that someone says there is some data and they understand it. The data scientist analysing the data must have access to the data and be confident in their understanding of it. The process of getting to this point, like any complex task, often follows Hofstadter's Law; it always takes longer than you expect, even when you take into account Hofstadter's Law.
The data readiness framework proposes a language for planning, communicating and tracking the process of getting data ready for analysis. This allows us to define progress in a data science project beyond just modelling outcomes and gives us the license to allocate the necessary amount of time and resources to data readiness. We have found adopting this framework very helpful when scoping our data science projects.