Edward commenced his doctoral studies at the University of Warwick in October 2016. His current interests lie at the intersection of fault tolerance, distributed systems and
Modern day data centres and HPC systems are comprised of complex combinations of networks, processors, storage systems and operating systems. Recent research has demonstrated the value and significance of combining system failure logs with resource utilisation data for failure diagnosis (and error detection). However, the massive amount of data that large HPC systems generate presents a significant challenge in processing the data for effective failure diagnosis.
Edward's PhD research is addressing the challenge by developing a new data-driven framework for error propagation and failure diagnosis. The framework uses resource usage data and system logs in its analyses. He evaluated different feature extraction methods and correlation algorithms and implemented two diagnostics workflows. The workflows are called CORRMEXT and EXERMEST. CORRMEXT was successful at identifying error propagation and recovery patterns that occurred frequently. EXERMEST was successful at identifying error propagation paths and error recovery attempts that are rare.