Data visualisation is essential to modern science and science communication, but has been criticised for hiding patterns in the raw data, obscuring statistical assumptions and distorting effect size. Using dependency-tracking techniques, this project is developing a transparent data visualisation framework which integrates provenance information directly into visualisations, allowing scientists to interactively explore how parts of diagrams are derived from data and/or code. Phase two of the project will allow users to explore the impact of changes in a dataset or model on the associated visualisations, so they can assess the implications of modelling decisions by considering alternative scenarios.
Explaining the science
This project adapts techniques from programming languages to the data visualisation domain.
Phase 1 of the project builds on prior work in dynamic program analysis. The approach used, called 'Galois slicing', formalises a two-way relationship between parts of programs and parts of outputs, identifying the “least” part of the program which is able to account for a given output feature, and conversely the 'least' output feature explained by a given part of the program.
Here, this idea is extended to charts and other graphics by defining a simple visualisation language suitable for use by data scientists, and implementing a dependency-tracking interpreter which tracks the fine-grained forward and backward dependencies between parts of images and related code/data. This dependency information will also form the foundation for phase two of the project, which will allow the exploration of 'counterfactual scenarios' in the form of (hypothetical) model or data changes.
The aim of the project is to bring substantially greater transparency to data visualisation. In a notebook system such as Jupyter, this will involve automatically linking diagrams to data and code so that fine-grained dependencies can be explored interactively. For example, moving the mouse over part of a chart will highlight which specific data elements were relevant.
A secondary goal is to provide this transparency to readers as well as authors, developing technology that makes it is possible to create interactive web content (e.g. online versions of papers) containing charts which provide transparent access to the raw data and underlying code. The project researchers believe this is essential to realising the vision of open science, and addresses the growing recognition of the need for more statistically robust, transparent approaches to data visualisation.
The second phase of the project will build on our dependency-tracking approach to allow scientists to explore alternative data or modelling scenarios interactively, obtaining visual feedback on precisely how a chart might change as a consequence of analysing or manipulating data in a different way. The research hypothesis is that interpreting a model or analysis properly involves understanding counterfactual scenarios – how the model would behave under different assumptions or different inputs.
The project is developing a proof-of-concept implementation which will integrate into the Wrattler notebook being developed here at the Turing. The researchers are looking at application areas in urban analytics, geocomputation, and data-centric engineering, but the techniques and software being developed should be applicable in any domain that makes use of data visualisation.
It is also intend that the technique will be integrated with related approaches to explainability in machine learning, such as guided back-propagation.
The team presented the work at a Tools, Practices and Systems workshop in Cambridge in April 2019. View the poster from the workshop.
Tomas Petricek - [email protected]