High performance computing (HPC) systems are currently transforming how information and data are stored and processed. However, this new paradigm comes with the challenge of successfully troubleshooting failures within these complex infrastructures while also ensuring that they continue to function successfully. Thus, analyzing failure behavior within these environments is extremely challenging due to the following reasons. First, these infrastructures contain thousands of nodes, containing a vast number of components and processors, that will eventually fail at some point. Another reason is that these types of anomalies are caused by variations of hardware, software, and human computer interaction. Diagnosing failures in these environments also poses two main practical dilemmas for systems administrators monitoring them. Shutting down HPC clusters, even temporarily, needs to be carefully considered because this requires much preparation and planning which can also impede the organizational processes that depend on them. On the other hand, resuming operations without proper failure diagnosis is risky for these anomalies can accumulate resulting in serious harm to these systems. Therefore, care must be taken to troubleshoot failures in these infrastructures as well as monitor these anomalies in real-time to ensure overall system resiliency.

At CRC, we’re currently developing ways of addressing these concerns from both systems and resource allocation perspectives. From a systems perspective, we are examining the nuances of failures in these environments by understanding job allocation from both micro and enterprise perspectives. To do this, we’re currently modeling and simulating failure behavior; examining the resulting system failure logs; and automatically correlating the resulting information. From a resource allocation perspective, we are currently analyzing how jobs are scheduled via examining their systems throughput to identify potential bottleneck scenarios.