Current machine learning (ML) development is frustrated by the 'two-language' problem: users and algorithm designers work in a high-level language (e.g. python) for rapid development, while performance-critical code must be written in a low-level language, such C. By solving the two-language problem, the Julia programming language will dramatically reduce the innovation cycle.
A machine learning toolbox provides a unified interface for interacting with multiple learning algorithms. An increasing number of such algorithms are available to Julia users but no mature pure-Julia ML toolbox exists. The present project will transform an existing proof-of-concept to a generously featured working prototype, while reaching out to end users to drive future development.
Explaining the science
Two of the most important functions of a machine learning toolbox are parameter tuning and model composition.
The solution to a machine learning task rarely involves the application of a single machine learning algorithm. Simple model composition involves inserting pre-processing operations, such as data cleaning and dimension reduction, into a 'pipeline' which finishes in an actual predictive model. However, it may also be advantageous to combine the predictions of multiple models in innovative ways. For example, in a process called 'stacking', the predictions of multiple models are forwarded to an 'adjudicating' model, which learns how to combine individual predictions optimally.
Every machine learning algorithm depends on number of auxiliary parameters which the data scientist must tune to optimise its performance. Parameter tuning has two important aspects. First is the question of how to efficiently carry out a search over multiple parameters that each take on a large range of conceivable values. Besides the naive systematic 'grid search' are algorithms that perform a random search. Some random search techniques are inspired by evolutionary process in nature known as 'genetic algorithms'.
The second concern of tuning is how to avoid over-fitting the training data. Performance of a model on new data will be poor if the data used to evaluate performance is the same data used to train it. The simplest strategy for mitigating this problem is to test performance on holdout set, but more sophisticated resampling strategies exist.
The present phase of development will deliver a Julia ML toolbox providing the following functionality:
- A flexible API for complex model composition, such as stacking
- A repository of externally implemented model metadata, for facilitating composite model design and for matching models to problems, through an MLR-like task interface
- Systematic tuning and benchmarking of models having possibly nested hyperparameters
- Unified interface for handling probabilistic predictors and multivariate targets
- Agnostic data containers
- Careful handling of categorical data types
A carefully designed and well documented interface will be key to encouraging implementation by existing and new Julia algorithm developers. A key aim for the project is to engage others already working in the Julia machine learning space, and to help others integrate their projects with the Turing's.
As a good machine learning toolbox is an essential ingredient in any data science workflow, the project can potentially impact any application area of AI and data science, including healthcare, criminal justice, public policy, urban-analytics, multiple areas of scientific research, data-centric engineering, finance and economics.
- Registered version of MLJ software released