CISE Faculty Seminar: Dr. Guanpeng Li

Date: February 27, 2025
Time: 12:00 PM - 1:00 PM
Location: 1889 Museum Road, Gainesville, Florida, 32611
Host: Department of CISE; Faculty Host: Dr. Mohammad Al-Saad
Admission: Free

Zoom Link: https://ufl.zoom.us/j/91351934359

Biography: Dr. Guanpeng Li is an Assistant Professor in the Department of Computer Science at the University of Iowa, where he has been a faculty member since 2020. He received his B.Sc. and Ph.D. from the University of British Columbia, Canada in 2014 and 2019, respectively. Before joining Iowa, he was a postdoctoral scholar at the University of Illinois Urbana-Champaign in 2020. Dr. Li’s research focuses on fault tolerance in High-Performance Computing
(HPC), data reduction for scientific applications, machine learning dependability, and the safety of autonomous driving systems. His work has been recognized with multiple Best Paper Awards and finalist recognitions at premier venues, including SC, DSN, and ISSRE, spanning 2024, 2022, 2021, 2020, and 2018. He is a recipient of the NSF CAREER Award and the IEEE Computer Society TCHPC Early Career Researchers Award for Excellence in High-Performance Computing (2024). Additionally, his research was selected for IEEE Top Picks in Test and Reliability in 2023 and 2024.

Title of the Talk: Towards Fault-Tolerant HPC for AI
and Scientific Computation

Biography: As High-Performance Computing (HPC) systems continue to scale, hardware errors are becoming an increasingly pressing challenge due to shrinking transistor sizes, growing manufacturing variability, and the complexity of heterogeneous architectures. These faults can significantly impact computational correctness and efficiency, leading to substantial financial and societal consequences. Ensuring the reliability of HPC workloads is particularly critical in the era of AI-driven scientific discovery and large-scale computation. In this talk, I will present innovative techniques to enhance fault tolerance in HPC, making systems more resilient to unreliable hardware. I will focus on two key areas: (1) compiler-based techniques that automatically enhance the robustness of scientific applications, and (2) symptom-based error detection methods tailored for AI computation infrastructure. These approaches enable modern HPC systems to maintain correctness and efficiency despite the challenges of hardware unreliability.