There is a growing academic and commercial interest in the measurement and analysis of hardware failures in pivotal data centres and high performance computing (HPC) systems. Recent error detection and failure diagnosis frameworks, which use data about how much computing resource is being used in combination with failure logs, have shown increased accuracy over using failure logs alone. In collaboration with Intel, this project is aiming to develop new frameworks to better detect, diagnose, and predict system errors and failures.
Explaining the science
Computing systems for modern day HPC and data centres are rapidly changing as new technologies and software grow. These computing systems are capable of generating a massive amount of often unstructured data of various different types. Therefore, it is crucial to find the right types of data and analyse it rapidly in order to detect, diagnose, and predict system errors and failures efficiently. This is an important, challenging task for improving the reliability and uptime of computing systems, and it’s importance is demonstrated in the increasing number of large-scale failure analysis research being published.
A significant body of research has also shown the value of failure logs for managing failures. Recent error detection and failure diagnosis frameworks, which use data about how much computing resource is being used in combination with failure logs, have shown increased accuracy over using failure logs alone.
As an example, when there are correlations between the use of system resources for both computing processes and memory allocation, and these activities occur at the same time as memory errors, it indicates that memory allocation activities are associated with the generation of memory errors. Therefore, it’s possible to use the monitoring, or ‘counters’, of the correlated activities of computing process and memory allocation, to assess the state of memory allocation in a system, and then use any associated memory errors to identify which applications are causing these errors.
This project involves studying the nature and characteristics of system errors and failures, developing new data-processing methodologies, and implementing tools for testing on actual large cluster systems. The knowledge that will be gained from the study can then be used to develop error recovery strategies. The reports that will be generated by these data-processing methodologies can then be used to support data centre systems administrators in system diagnosis (and failure prediction). In addition the tools that will be implemented have the potential to be used in automating diagnostics workflows.
Specifically, this project is producing a framework for analysing and reporting error propagation patterns and degrees of success and failure of error recovery protocols. The framework uses both failure logs and resource use data in its analysis. It has the potential to be adapted for application to any cluster system or supercomputer that generates resource use data and failure logs.
The framework has been applied to resource use data and failure logs on three different large HPC systems operated by the Texas Advanced Computing Center. The analyses generated by the framework have revealed many interesting insights into patterns of memory allocation and memory leaks, communication and file-system I/O errors, and chipset and memory errors.
The framework will continue to be tested on the HPC systems at the Texas Advanced Computing Center as well as at other data centres that operate HPC systems that generate resource use data and failure logs.
For more information, please contact The Alan Turing Institute