Modern computer systems often have many tunable parameters to accommodate a range of workloads including dynamic change of parameters. Even for experts, a lengthy trial-and-error process is needed to obtain sufficient understanding about customer’s workloads.
First, tuning complex and high dimensional systems is a resource extensive task with standard optimisation methods, and individual experiments may take hours. This project attempts to alleviate these constraints by injecting expert knowledge into the optimisation procedure, thus guiding the model to high performing configuration regions using a novel 'structured Bayesian optimisation' (SBO). The project also investigates reinforcement learning (RL) and aims to bring similar improvements to the control of dynamically evolving tasks such as scheduling or resource management. SBO or RL based optimisation can build an adaptive and robust tuner to achieve optimal performance.
These methodologies are being applied to real-wold applications to demonstrate the efficacy of a new generation of machine learning based optimisation techniques, including ASICS design over hardware and software combined simulation platform, and traffic signal controlling in London.
Visit the Cambridge project page for more information.
Explaining the science
Structured Bayesian optimisation
The project's 'BespOke Auto-Tuner' (BOAT) framework allows developers to build efficient bespoke auto-tuners, which can achieve their iterative evaluation much faster. The core of BOAT is a novel extension of Bayesian optimisation, called 'structured Bayesian optimisation' (SBO), which leverages contextual information in the form of a probabilistic model of systems behaviour.
BOAT provides a Probabilistic C++ library for building such a model. Adding structural information in a probabilistic model of the objective function in Bayesian optimisation outperforms standard Gaussian processes by orders of magnitude. An initial case study has been the hyper-parameters of a convolutional neural network to optimise the accuracy of the model and the relationship between these hyper-parameters.
This probabilistic model will be used to obtain optimal values of hyper-parameters for target tasks. Diverse range of objectives for optimisation could be studied, such as model size or robustness on adversarial images.
RL has distinct advantages for computer systems. Combinatorial optimisation and discrete decision making problems have been identified as promising avenues, because they are difficult to address with optimisation methods targeting continuous functions (e.g. Bayesian optimisation). The lack of standard software and tools is an obstacle in contemporary research, which results in the need to re-implement over and over the same set of standard algorithms or to use poorly understood or motivated open-source algorithms.
Computer systems experiments are distinct from typical RL research domains. They are more expensive to execute than common RL benchmarks such as games, and are also easier to parallelise and restart than other traditional RL domains such as robotics. Moreover, contemporary computer systems provide vast amounts of real-time monitoring and performance information, which can be used in the form of historical traces to extract initial model information, thus shortening training times.
This project will generate a software stack addressing the algorithmic level as a standard interface to common algorithms and the system model level. This approach will help to connect typical systems, such as databases or stream engines, to a reinforcement learning training and execution cycle. An initial case study could be resource management/scheduling in stream processing.
Daniel Hernández-Lobato, José Miguel Hernández-Lobato, Amar Shah, Ryan P. Adams: Predictive Entropy Search for Multi-objective Bayesian Optimization, ICML, 2016.
M.Schaarschmi, S. Mika, K. Fricke and E. Yoneki: RLgraph: Flexible Computation Graphs for Deep Reinforcement Learning, 2018.
V. Dalibard, M. Schaarschmidt, and E. Yoneki: BOAT: Building Auto-Tuners with Structured Bayesian Optimization. WWW, Systems and Infrastructure Track, 2017.
The proposed research will investigate how Bayesian optimisation and deep reinforcement learning can be applied towards automatic performance tuning of systems, where parameter space is high-dimensional and complex. Modern computer systems often have many tunable parameters to accommodate a wide range of workloads. But fine-tuning a computer system is challenging and requires extensive knowledge and experience. A lengthy trial-and-error process is needed to obtain sufficient understanding about customer workloads such as training of the machine learning model.
Methods like Bayesian optimisation can reliably identify high-performing configurations with few iterations, often outperforming default configurations in throughput, latency, and resource usage by more than 50%. As compute infrastructure is increasingly deployed in multi-cloud, hybrid-cloud and heterogeneous, modern data-driven service architectures need to be constantly tuned to evolving conditions. The proposed project focuses on the two aspects below and applies them to the case studies.
First, tuning complex configurations that exceed 20-30 parameters is a resource extensive task when using standard optimisation methods as the space of possible configurations grows exponentially. Contemporary work in evolutionary methods or reinforcement learning often seek to address these limitations by introducing massive parallelism and leveraging large-scale cluster computing resources, which limits the practical utility of these methods to a few organisations.
This project alleviates these constraints by injecting human expert knowledge into the optimisation procedure, thus quickly guiding the model to high performing configuration regions. Consequently, recent work in 'structured Bayesian optimisation' (SBO) results in better configurations and faster optimisations in domains such as distributed machine learning training.
Second, reinforcement learning will be investigated with the aim to bring similar improvements to the control of dynamically evolving tasks such as scheduling or resource management. Classical reinforcement learning methods suffer from large training data requirements, lack of stability, and consequently impractical training times. The proposed work seeks to address these limitations by leveraging existing trace data and expert knowledge to guide the learning process via human demonstration.
The project's researchers have implemented an initial open source tool to realise the above goals. These methodologies will be applied to real-world applications to demonstrate the efficacy of a new generation of machine learning based optimisation techniques. The current plans for case studies include ASICS design over hardware and software combined simulation platform, and traffic signal controlling in London.
Moreover, the benchmarking on various case studies on SBO and RL (e.g. Device allocation in neural network model training, LLVM based compiler optimisation, neural network hyper parameter tuning, JVM garbage collection, cluster scheduling, database query indexing, stream processing) will demonstrate the significance of the methodologies.
As the project evolves, expanding the knowledge to potential users an domain, case studies will be extended over wider communities.