An increasing number of non-specialist academics are interested in exploring popular software tools for machine learning and artificial intelligence, but lack the skills required to effectively configure and deploy these into their centrally-managed university HPC environment. This project will explore the use of community-developed HPC build and deployment tools to provision some of the most requested data science software on the University of Warwick HPC clusters. This will enable a wider range of researchers to implement data science workflows at scale, exploiting existing investment in HPC hardware infrastructure at Warwick and elsewhere.
Explaining the science
Despite the ever-increasing power of laptop and desktop computing hardware, some data science projects inevitably require access to large-scale high performance computing (HPC) clusters to tackle large data sets and simulations. In the academic context, this involves university-level HPC clusters.
The prevailing model of software provision on a managed HPC system is to build each package from source code, optimised and tested for the specific cluster hardware. Software is not accessible to all users by default to prevent conflicts. Each user imports only the software they need (and its dependencies) via 'environment modules'.
Software tools needed by users wishing to move data science workflows onto managed HPC systems are poorly supported by this model. For example, installation instructions for popular deep learning, computer vision and statistical packages typically assume a single-user environment in which the user has administrative (root) privileges, and where globally installing specific versions of library dependencies has no consequences for other software on the system. In many cases, developers of these tools actively discourage users from building the software from source due to dependencies on large numbers of immature and rapidly changing libraries. This lack of portability makes optimal use of existing HPC platforms problematic at best.
Workarounds such as containerisation/virtualisation, while valuable to a certain extent for capturing snapshots of a software stack for reproducibility reasons, do require users to build suitable containers. When targeting an HPC system, building effective containers with appropriate support for GPU co-processors and a specific low-latency network technology is not a task most users are willing or equipped to undertake. Similarly, packages available via repositories (e.g. pypi.org) are rarely useful in a managed HPC environment.
This project is working to implement support for popular data science and artificial intelligence toolkits within community HPC tools for build and deployment of scientific software.
This project will proactively develop improved community HPC build tool support for the most frequently requested data science toolkits (e.g. TensorFlow, Keras, PyTorch), their many dependencies, and various interfaces to programming languages popular in the data science community. The project will work across both x86 and IBM OpenPower HPC platforms, both of which are in use at Warwick and other Turing partner universities.
In the ideal scenario, there will be automated optimised deployment of the latest stable (if relevant) versions of the most popular toolkits, providing researchers with rapid access to the very latest tools, without the time sink of fighting against build dependencies and software stack incompatibilities.
This project underpins any data science project which requires use of shared computing facilities. The HPC facilities managed by the Scientific Computing Research Technology Platform at the University of Warwick are being used as a test case. Current use of this facility for data science workflows includes applications in particle physics, engineering and the Warwick Business School, demonstrating the need to support non-specialist users outside of the traditional data science domain of mathematics, statistics and computer science.