Introduction

Data analytics is the process of transforming a raw dataset into useful knowledge. By drawing on new advances in artificial intelligence and machine learning, this project is aiming to develop systems that will help to automate the data analytics process.

Explaining the science

Data analytics comprises many different stages and phases. While some elements of the data analytics process have benefited from considerable development through software or tools, there has been little methodological research into so-called data ‘wrangling’, even though this is often laborious and time-consuming, and accounts for up to 80% of a typical data science project.

Data wrangling includes understanding what data is available, integrating data from multiple sources, identifying missing, messy or anomalous data, and extracting features in order to prepare data for computer modelling.

Project aims

Drawing on new advances in artificial intelligence and machine learning to produce technology that will help automate each stage of the data analytics process. This technology will revolutionise the speed and efficiency with which data can be transformed into useful knowledge.

The project has the potential to dramatically improve the productivity of working data scientists and benefit researchers, industry, and government.

Recent updates

September 2019

The first step in data science is the importing of a dataset into an analysis program. Often, tabular data is stored in comma-separated value (CSV) files. However these files don't have a standardised format and therefore often require manual inspection or repair before they can be imported. In Wrangling Messy CSV Files by Detecting Row and Type Patterns — recently published in the journal Data Mining and Knowledge Discovery — members of the AIDA team present a method for automatically detecting the formatting parameters of these kinds of files. Their method achieves 97% accuracy and improves the previous state of the art  by almost 22% on messy CSV files. A software package that implements this technique is available here.

May 2019

The AIDA project aims at developing a family of systems to provide a better (semantic) understanding of tabular data. ColNet is the first member of this family to predict the semantic type of a table column. ColNet was presented in the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019) and an extended version has been recently accepted in 28th International Joint Conference on Artificial Intelligence (IJCAI 2019). See related papers and software packages here.

March 2019

The AIDA team co-organizes, together with IBM research, a Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. The challenge will be collocated with the 18th International Semantic Web Conference and the 14th International Workshop on Ontology Matching.

August 2018

The first data analytics user-interface component to be properly implemented as part of this project is called Data Diff. It allows for repeated data analyses tasks to be done more easily on different datasets, and is showing extremely promising initial results. Data Diff was presented at the 2018 Knowledge Discovery and Data Mining Conference (KDD). See the related paper, and software package.

Organisers

Researchers and collaborators

Contact info

[email protected]

Research Engineering

View the Research Engineering page

Members of the Research Engineering Group at the Turing are contributing their expertise to this project.

They have been involved in the development of the Datadiff tool, which can algorithmically detect and suggest fixes for cases where columns in tabular datasets may have been renamed or mislabeled, or differ in some other way, such as for example a change of units.

They are also involved in the development of the Wrattler notebook, which is an application for literate programming, interspersing markdown text with code in several languages, and with an emphasis on self-consistency and reproducibility.

Furthermore, members of the group are implementing several trial analyses on public datasets using the tools, and documenting the procedure.