Introduction
Scientists and businesses seek to understand ever more complex and large datasets. This involves developing and then testing hypotheses, using a combination of human skill and computation. Any such undertaking requires both visualisations and statistical methods, that help the researcher investigate hypotheses. There are a growing number of toolsets for data visualisations (e.g. D3.js, Bokeh, Datashader) and for machine learning (e.g. Weka, Scikit-Learn, Mahout), but the effort to integrate them in a single project falls on the user. This project aims at developing integrated solutions for high- and hyper-dimensional data modelling in health and the physical sciences.
Project aims
The project aims to develop proof-of-concept tools integrating JupyterLab, Scikit-Learn and Bokeh, to provide an interactive environment for iteratively developing models of datasets. It will match visualisation methods and machine learning methods, so a single data science environment will guide and aid the user in both the building of a predictive model, and a visualisation which shows its strengths and weaknesses.
Many metrics exist for model quality, and help is needed to understand which is most relevant in a particular context, so a significant part of the work will be targeted on visualising the quality of models. This will be part of the workflow which offers alternative predictive models and allows the user to finally build a consensus model.
Applications
The work will focus on high- and hyper-dimensional classification problems, but finally, a short scoping study will review the potential for similar approaches to more complex outputs: e.g. multi-output regression. Further work could follow this initial project to produce a tool at a high technology readiness level, ready for use in the target areas of health and physical sciences.