Current data science pipelines often lack reproducibility and transparency. This is can be enabled in the open-source machine learning software library Shogun by adding support for two important data science standards: openML for structured pipeline representations and coreML for portable model representations. This will allow data scientists to exchange and analyse their workflows more easily.
Explaining the science
To support both openML and coreML the Shogun API needs to expose model parameters, and in particular this access needs to be unified across models. As Shogun is being developed with unification across different interface languages in mind, we can use Shogun’s parameter framework as a main building block to serve both standards.
A particular challenge in serving the openML standard are the differences between typed/compiled languages (such as C++, where every variable has a fixed type known before the program is executed), and untyped or interpreted languages (such as Python, where the types of variables are only known once the program is executed).
For example, in Python code model parameters can often be extracted by a simple
param = model.get_parameter(“my_param”) or even
param = model.my_param, where the parameter is identified by a string and its type is not specified before code execution. In contrast, an equivalent call in C++ code would traditionally be
DoubleMatrix mat = Model::get_my_param(), i.e. the return type needs to be known in advance and the parameter is identified by a method name.
In order to export or serialise models to the openML Python package, the parameter values and types are queried, and are then extracted iteratively. Shogun’s parameters can already be accessed in a unified manner via
DoubleMatrix mat = Model::get(“my_param”). While specifying the parameter with a string (as in Python) simplifies parameter identification at runtime, the return type (DoubleMatrix) still needs to be known before executing the code. Thus, an initial challenge of this project involves automatically dispatching the parameter’s type within the Python interface. This will allow for interfacing Shogun’s C++ code from Python with a simple
mat = model.get(“matrix”).
Connecting Shogun to the openML standard will allow Shogun pipelines to be exported in the structured openML format. This format serves two main purposes. First, it allows to easily exchange workflows: instead of separately sharing code, data, and parameters in the traditional way, scientists simply share a workflow which contains all the information necessary to reproduce it locally. Second, the openML website allows for the upload of workflows to their database which can be queried and indexed. This allows researchers to learn from each other, and more importantly to perform meta learning across workflows of potentially thousands of users.
Adding support for the coreML standard will allow users to store machine learning models in a library-independent manner. Once a model has been trained (for example in one of Shogun’s interfaces, sklearn, or Google’s tensorflow), it can be exported and exchanged in the coreML format. These stored models then can be executed on any platform which supports coreML models (e.g. any Apple product, open source libraries).
In particular, the execution is independent of the framework and of data that was used to build and train the model in the first place. Furthermore, since the CoreML model execution framework is stand-alone, it can be specifically engineered for on-device performance (e.g. on smartphones).
Many industrial software systems are constrained in the choice of computing environments. For example, Java and C# are very commonly used outside academia - yet there are few open-source machine learning frameworks available for these platforms. On the other hand, academic communities are sometimes isolated from each other’s code due to singular language choices (e.g. R vs Matlab vs Python vs Julia). Therefore, Shogun, with its multiple language unified interface, accommodates requirements that a single-language framework cannot.
Combining this lower integration barrier with the transparency of openML workflows and the portability of coreML, will enable Shogun to serve a multitude of possibilities when using machine learning in industrial or academic applications.
Transparency and portability are key for industrial machine learning applications. Transparency of how a particular prediction was created (as represented in a reproducible openML workflow) is one of the most important requirements for embedding machine learning models within decision making systems, in particular for critical decisions that involve people (intensive care, insurance, etc), or critical utility industries (gas pipe networks, drilling, etc).
As Apple has demonstrated with embedding coreML into its mobile products (Siri, etc), being able to reliably roll out previously built machine learning models at a large scale facilitates embedding prediction algorithms into applications in various domains, such as entertainment. Mobile devices are not limited to entertainment but are present in a much wider range of industries, such as autonomous vehicles, mobile monitoring devices, and security monitoring systems, just to name a few.
- Additional C++ types are exposed to the Shogun interfaces
- Shogun’s Python interface becomes compatible with openML
- First openML workflow exported
- First coreML model exported
- Eleftherios Avramidis joins the project as second developer
- Project start, Gil Hoben joins as main developer, Viktor Gal and Heiko Strathmann join as project coordinators.