Data analytics is the process of transforming a raw dataset into useful knowledge. By drawing on new advances in artificial intelligence and machine learning, this project is aiming to develop systems that will help to automate the data analytics process.
Explaining the science
Data analytics comprises many different stages and phases. While some elements of the data analytics process have benefited from considerable development through software or tools, there has been little methodological research into so-called data ‘wrangling’, even though this is often laborious and time-consuming, and accounts for up to 80% of a typical data science project.
Data wrangling includes understanding what data is available, integrating data from multiple sources, identifying missing, messy or anomalous data, and extracting features in order to prepare data for computer modelling.
Drawing on new advances in artificial intelligence and machine learning to produce technology that will help automate each stage of the data analytics process. This technology will revolutionise the speed and efficiency with which data can be transformed into useful knowledge.
The project has the potential to dramatically improve the productivity of working data scientists and benefit researchers, industry, and government.
It is widely recognised that most of the analysts' time is taken up by data engineering tasks such as acquiring, understanding, cleaning and preparing the data. In Data Engineering for Data Analytics: A Classification of the Issues, and Case Studies, we provide a description and classification of such tasks into high-level groups, namely data organisation, data quality, and feature engineering. A repository with the analysis performed in four case studies exhibiting a wide variety of these problems is also available here. Our main goal with this work is to encourage the development of tools and techniques to help reduce this burden and push forward research towards the automation or semi-automation of the data engineering process.
One of the main phases of the data mining process is "data understanding", where the aim is to discover the characteristics of the data such as its data types (e.g., date, float, integer, and string). In order to accelerate data understanding, a data type can be automatically inferred for each column in a table of data. However, previous approaches often failed when data contains missing data and anomalies, which are commonly found in real-world data sets. To this end, we have developed ptype — a probabilistic model that detects such entries and robustly infers data types. The article that introduces ptype has recently been published in the Data Mining and Knowledge Discovery journal.
Change point detection is an important problem in time series analysis, since the presence of a change point indicates that the distribution of the data changed abruptly and significantly. Moreover, a change point can signal a data quality issue such as a change in the reporting method of a variable. While many different approaches to change point detection exist, little attention has been paid to the evaluation of these methods on real-world time series. In An Evaluation of Change Point Detection Algorithms we introduce a novel dataset designed specifically for this purpose, and use it to compare a wide variety of existing methods. The dataset and the benchmark study are made freely available to encourage the development of change point detection algorithms that perform well in practice.
The existence of outliers in real world data is a problem data scientists face daily. In Robust Variational Autoencoders for Outlier Detection in Mixed-Type Data we treat the problem of unsupervised outlier detection and repair of cells in mixed-type datasets. We show experimentally that not only does RVAE perform better than several state-of-the-art methods in cell outlier detection and repair for tabular data, but also that it is robust against the initial hyper-parameter selection. This work will be presented at AISTATS 2020 and a repository with the code employed for the experiments is available here.
The first step in data science is the importing of a dataset into an analysis program. Tabular data is commonly stored in comma-separated value (CSV) files. However these files don't have a standardised format and therefore often require manual inspection or repair before they can be imported. In Wrangling Messy CSV Files by Detecting Row and Type Patterns — recently published in the journal Data Mining and Knowledge Discovery — members of the AIDA team present a method for automatically detecting the formatting parameters of these kinds of files. Their method achieves 97% accuracy and improves the previous state of the art by almost 22% on messy CSV files. A software package that implements this technique is available here.
The AIDA project aims at developing a family of systems to provide a better (semantic) understanding of tabular data. ColNet is the first member of this family to predict the semantic type of a table column. ColNet was presented in the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019) and an extended version has been recently accepted in 28th International Joint Conference on Artificial Intelligence (IJCAI 2019). See related papers and software packages here.
The AIDA team co-organizes, together with IBM research, a Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. The challenge will be collocated with the 18th International Semantic Web Conference and the 14th International Workshop on Ontology Matching.
The first data analytics user-interface component to be properly implemented as part of this project is called Data Diff. It allows for repeated data analyses tasks to be done more easily on different datasets, and is showing extremely promising initial results. Data Diff was presented at the 2018 Knowledge Discovery and Data Mining Conference (KDD). See the related paper, and software package.