Dave’s paper, AC-GC: Lossy Activation Compression with Guaranteed Convergence was accepted to appear in the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021)!

Abstract: The memory-intensive process needed to train Deep Neural Networks (DNNs) can lead to poor performance on parallel devices (e.g., GPUs) or decreased accuracy due to restricting the batch size. Previous attempts to address this issue, using lossy compression and offloading of activations, rely on expensive hyperparameter search to achieve a suitable trade-off between convergence and compression. There is no guarantee that the tuned compression rates in these approaches will adapt well to other models. Our key insight is to use recent developments on Stochastic Gradient Descent convergence to prove an upper bound on the expected loss increase when training with compressed activation storage. We then express activation compression error in terms of this bound, allowing the compression rate to adapt to training conditions automatically. When combined with error-bounded methods, our approach achieves 15.5× compression on CIFAR10/ResNet50 versus 11.0× for the best manually tuned approach. Our method is suitable for any model, layer, and dataset, with an average accuracy change of 0.10%.