Muhammad Adnan

4025 Fred Kaiser, UBC

I am a Ph.D. student in the Electrical and Computer Engineering department at The University of British Columbia advised by Prof. Prashant Nair. My research interest centers at the intersection of architecture and systems, with a particular focus on addressing the challenges posed by large machine learning models including recommendation, large language and multimodal models.

I received my Master of Applied Science (M.A.Sc.) from ECE, UBC advised by Prof. Prashant Nair.

news

Nov 19, 2024	Our paper titled Slipstream: Semantic-Based Training Acceleration for Recommendation Models has been accepted at DATE 2025.
Sep 11, 2024	We would be giving a tutorial on Training Big Sparse Recommendation Models on Commodity Servers at IISWC 2024.
Apr 30, 2024	I recieved prestigious NSERC Canada Graduate Scholarship - Doctoral (CGS D) award.
Mar 20, 2024	Our paper titled Hetrogeneous Acceleration Pipeline for Recommendation System Training has been accepted at ISCA 2024.
Mar 18, 2024	Our paper titled Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference has been accepted at MLSys 2024.
May 18, 2023	I was selected as Machine Learning and Systems Rising Star in the 2023 cohort by MLCommons.
Jan 26, 2023	We would be giving a tutorial on Training Big Sparse Recommendation Models on Commodity Servers at HPCA 2023.

selected publications

VLDB
Accelerating Recommendation System Training by Leveraging Popular Choices

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair

In Proceedings of the 48th International Conference on Very Large Data Bases (VLDB) 2021

Abs Bib HTML PDF Code

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items’ and users’ categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.
@inproceedings{adnan2022hotembeddings, title = {Accelerating Recommendation System Training by Leveraging Popular Choices}, author = {Adnan, Muhammad and Ebrahimzadeh Maboud, Yassaman and Mahajan, Divya and Nair, Prashant J.}, booktitle = {Proceedings of the 48th International Conference on Very Large Data Bases (VLDB)}, location = {Sydney, Australia}, series = {VLDB 2022}, doi = {10.14778/3485450.3485462}, issn = {2150-8097}, year = {2021}, issue_date = {September 2021}, publisher = {VLDB Endowment}, journal = {Proc. VLDB Endow.}, volume = {15}, number = {1}, }
ISCA
Heterogeneous Acceleration Pipeline for Recommendation System Training

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair

In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) 2024

Abs Bib HTML PDF

Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU’s neural network acceleration with the CPUs’ memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs’ HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches (μ-batches). It gathers the necessary working parameters for non-popular μ-batches from the CPU, while GPUs execute popular μ-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU’s main memory. Real-world datasets and models confirm Hotline’s effectiveness, reducing average end-to-end training time by 2.2× compared to Intel-optimized CPU-GPU DLRM baseline.
@inproceedings{adnan2024heterogeneous, title = {Heterogeneous Acceleration Pipeline for Recommendation System Training}, author = {Adnan, Muhammad and Ebrahimzadeh Maboud, Yassaman and Mahajan, Divya and Nair, Prashant J.}, booktitle = {2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)}, location = {Buenos Aires, Argentina}, series = {ISCA 2024}, year = {2024}, doi = {10.1109/ISCA59077.2024.00081}, }
MLSys
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, and Purushotham Kamath

In Proceedings of Machine Learning and Systems 2024

Abs Bib HTML PDF Blog Code

Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces “Keyformer”, an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as “key” tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer’s performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer’s reduction of KV cache reduces inference latency by 2.1× and improves token generation throughput by 2.4×, while preserving the model’s accuracy.
@inproceedings{adnan2024keyformer, title = {Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference}, author = {Adnan, Muhammad and Arunkumar, Akhil and Jain, Gaurav and Nair, Prashant J. and Soloveychik, Ilya and Kamath, Purushotham}, booktitle = {Proceedings of Machine Learning and Systems}, location = {Santa Clara, CA, USA}, series = {MLSys 2024}, year = {2024}, }
DATE
Slipstream: Semantic-Based Training Acceleration for Recommendation Models

Yassaman Ebrahimzadeh Maboud, Muhammad Adnan, Divya Mahajan, and Prashant J. Nair

In 28th Design, Automation and Test in Europe Conference (DATE) 2025

Abs Bib HTML PDF

Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
@inproceedings{maboud2024slipstream, title = {Slipstream: Semantic-Based Training Acceleration for Recommendation Models}, author = {Ebrahimzadeh Maboud, Yassaman and Adnan, Muhammad and Mahajan, Divya and Nair, Prashant J.}, booktitle = {28th Design, Automation and Test in Europe Conference (DATE)}, location = {Lyon, France}, series = {DATE 2025}, year = {2025}, eprint = {2404.04270}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, }