Data analytics is the process of transforming a raw dataset into useful knowledge. By drawing on new advances in artificial intelligence and machine learning, this project is aiming to develop systems that will help to automate the data analytics process.
Explaining the science
Data analytics comprises many different stages and phases. While some elements of the data analytics process have benefited from considerable development through software or tools, there has been little methodological research into so-called data ‘wrangling’, even though this is often laborious and time-consuming, and accounts for up to 80% of a typical data science project.
Data wrangling includes understanding what data is available, integrating data from multiple sources, identifying missing, messy or anomalous data, and extracting features in order to prepare data for computer modelling.
Drawing on new advances in artificial intelligence and machine learning to produce technology that will help automate each stage of the data analytics process. This technology will revolutionise the speed and efficiency with which data can be transformed into useful knowledge.
The project has the potential to dramatically improve the productivity of working data scientists and benefit researchers, industry, and government.
The first step in data science is the importing of a dataset into an analysis program. Often, tabular data is stored in comma-separated value (CSV) files. However these files don't have a standardised format and therefore often require manual inspection or repair before they can be imported. In Wrangling Messy CSV Files by Detecting Row and Type Patterns — recently published in the journal Data Mining and Knowledge Discovery — members of the AIDA team present a method for automatically detecting the formatting parameters of these kinds of files. Their method achieves 97% accuracy and improves the previous state of the art by almost 22% on messy CSV files. A software package that implements this technique is available here.
The AIDA project aims at developing a family of systems to provide a better (semantic) understanding of tabular data. ColNet is the first member of this family to predict the semantic type of a table column. ColNet was presented in the 33rd AAAI Conference on Artificial Intelligence (AAAI 2019) and an extended version has been recently accepted in 28th International Joint Conference on Artificial Intelligence (IJCAI 2019). See related papers and software packages here.
The AIDA team co-organizes, together with IBM research, a Semantic Web Challenge on Tabular Data to Knowledge Graph Matching. The challenge will be collocated with the 18th International Semantic Web Conference and the 14th International Workshop on Ontology Matching.
The first data analytics user-interface component to be properly implemented as part of this project is called Data Diff. It allows for repeated data analyses tasks to be done more easily on different datasets, and is showing extremely promising initial results. Data Diff was presented at the 2018 Knowledge Discovery and Data Mining Conference (KDD). See the related paper, and software package.