Error Resilient Machine Learning Systems

Machine learning applications play a key role in many domains including autonomous robots. As a society, we have come to rely increasingly on these applications. Our dependence on machine learning applications is projected to increase manifold as we move towards greater and greater degrees of automation. Therefore, it is vital that these applications be reliable and safe, as a fault can cause catastrophic consequences (e.g., loss of life). A lot of attention has been paid to software bugs in machine learning algorithms, and a variety of techniques have been developed for them. In addition to software bugs, an important source of faults is due to errors in computer chips. These errors, also known as soft errors, occur due to cosmic rays and other particles hitting the electronic devices on the chips, causing them to malfunction. Soft errors are increasing in frequency, as chip manufacturers move towards smaller devices so that they can pack more of them on the chip, thereby making the chips faster. Soft errors can compromise the safety of machine learning applications, and need to be mitigated. Traditional techniques for mitigating soft errors such as running two or more copies of the program, incur very high overheads in performance and energy consumption, thereby making them impractical to deploy in commodity systems.

In this project, which is funded by a multi year NSERC strategic project grant awarded in late 2017, we investigate joint hardware-software-algorithmic techniques for designing error resilient machine learning algorithms, in the context of two safety-critical applications namely robots. In particular, we will develop fundamental innovations to systematically reason about and introduce redundancy in a controlled manner into machine learning applications, so as to mitigate only the most consequential soft errors, while keeping overheads and costs low. Through the research in this proposal, we will be able to develop both efficient and reliable chips to execute machine learning applications in a safe manner. This will ensure that our society continues to enjoy the fruits of the rapid advances in machine learning for many more years, and that Canada continues to be a leader in this area.

Publications

Deval Shah, Zi Yu Xue, Karthik Pattabiraman, Tor Aamodt, Characterizing and Improving Resilience of Accelerators to Memory Errors in Autonomous Robots, To appear in ACM Transactions on Cyber-Physical Systems.
Deval Shah, Ningfeng Yang, Tor M. Aamodt, Energy-Efficient Realtime Motion Planning, In proceedings of the IEEE/ACM International Symposium on Computer Architecture (ISCA 2023), Orlando, FL, USA, June 17-21, 2023. (acceptance rate: 79/372 ≈ 21.2%)
Deval Shah, Tor M. Aamodt, Learning Label Encodings for Deep Regression, In proceedings of the 11th International Conference on Learning Representations (ICLR 2023), Kigali, Rwanda, May 1-5, 2023. (Spotlight presentation)
Deval Shah, Zi Yu Xue, Tor Aamodt, Label Encoding for Regression Networks, To appear in proceedings of the Tenth International Conference on Learning Representations (ICLR 2022), Virtual Conference, Apr 25-29, 2022. (Spotlight presentation)