Publications | Muhammad Adnan

2025

DATE
Slipstream: Semantic-Based Training Acceleration for Recommendation Models

Yassaman Ebrahimzadeh Maboud, Muhammad Adnan, Divya Mahajan, and Prashant J. Nair

In 28th Design, Automation and Test in Europe Conference (DATE) 2025

Abs Bib HTML PDF

Training recommendation models pose significant challenges regarding resource utilization and performance. Prior research has proposed an approach that categorizes embeddings into popular and non-popular classes to reduce the training time for recommendation models. We observe that, even among the popular embeddings, certain embeddings undergo rapid training and exhibit minimal subsequent variation, resulting in saturation. Consequently, updates to these embeddings lack any contribution to model quality. This paper presents Slipstream, a software framework that identifies stale embeddings on the fly and skips their updates to enhance performance. This capability enables Slipstream to achieve substantial speedup, optimize CPU-GPU bandwidth usage, and eliminate unnecessary memory access. SlipStream showcases training time reductions of 2x, 2.4x, 1.2x, and 1.175x across real-world datasets and configurations, compared to Baseline XDL, Intel-optimized DRLM, FAE, and Hotline, respectively.
@inproceedings{maboud2024slipstream, title = {Slipstream: Semantic-Based Training Acceleration for Recommendation Models}, author = {Ebrahimzadeh Maboud, Yassaman and Adnan, Muhammad and Mahajan, Divya and Nair, Prashant J.}, booktitle = {28th Design, Automation and Test in Europe Conference (DATE)}, location = {Lyon, France}, series = {DATE 2025}, year = {2025}, eprint = {2404.04270}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, }
NeurIPS
Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation

Muhammad Adnan, Nithesh Kurella, Akhil Arunkumar, and Prashant J. Nair

In Proceedings of Neural Information Processing Systems 2025

Abs Bib HTML PDF Code

Diffusion Transformers (DiTs) achieve state-of-the-art results in text-to-image, text-to-video generation, and editing. However, their large model size and the quadratic cost of spatial-temporal attention over multiple denoising steps make video generation computationally expensive. Static caching mitigates this by reusing features across fixed steps but fails to adapt to generation dynamics, leading to suboptimal trade-offs between speed and quality. We propose Foresight, an adaptive layer-reuse technique that reduces computational redundancy across denoising steps while preserving baseline performance. Foresight dynamically identifies and reuses DiT block outputs for all layers across steps, adapting to generation parameters such as resolution and denoising schedules to optimize efficiency. Applied to OpenSora, Latte, and CogVideoX, Foresight achieves up to 1.6x end-to-end speedup, while maintaining video quality.
@inproceedings{adnan2025foresight, title = {Foresight: Adaptive Layer Reuse for Accelerated and High-Quality Text-to-Video Generation}, author = {Adnan, Muhammad and Kurella, Nithesh and Arunkumar, Akhil and Nair, Prashant J.}, booktitle = {Proceedings of Neural Information Processing Systems}, location = {San Diego, CA, USA}, series = {NeurIPS 2025}, year = {2025}, }

2024

ISCA
Heterogeneous Acceleration Pipeline for Recommendation System Training

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair

In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA) 2024

Abs Bib HTML PDF

Recommendation models rely on deep learning networks and large embedding tables, resulting in computationally and memory-intensive processes. These models are typically trained using hybrid CPU-GPU or GPU-only configurations. The hybrid mode combines the GPU’s neural network acceleration with the CPUs’ memory storage and supply for embedding tables but may incur significant CPU-to-GPU transfer time. In contrast, the GPU-only mode utilizes High Bandwidth Memory (HBM) across multiple GPUs for storing embedding tables. However, this approach is expensive and presents scaling concerns. This paper introduces Hotline, a heterogeneous acceleration pipeline that addresses these concerns. Hotline develops a data-aware and model-aware scheduling pipeline by leveraging the insight that only a few embedding entries are frequently accessed (popular). This approach utilizes CPU main memory for non-popular embeddings and GPUs’ HBM for popular embeddings. To achieve this, Hotline accelerator fragments a mini-batch into popular and non-popular micro-batches (μ-batches). It gathers the necessary working parameters for non-popular μ-batches from the CPU, while GPUs execute popular μ-batches. The hardware accelerator dynamically coordinates the execution of popular embeddings on GPUs and non-popular embeddings from the CPU’s main memory. Real-world datasets and models confirm Hotline’s effectiveness, reducing average end-to-end training time by 2.2× compared to Intel-optimized CPU-GPU DLRM baseline.
@inproceedings{adnan2024heterogeneous, title = {Heterogeneous Acceleration Pipeline for Recommendation System Training}, author = {Adnan, Muhammad and Ebrahimzadeh Maboud, Yassaman and Mahajan, Divya and Nair, Prashant J.}, booktitle = {2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA)}, location = {Buenos Aires, Argentina}, series = {ISCA 2024}, year = {2024}, doi = {10.1109/ISCA59077.2024.00081}, }
MLSys
Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference

Muhammad Adnan, Akhil Arunkumar, Gaurav Jain, Prashant J. Nair, Ilya Soloveychik, and Purushotham Kamath

In Proceedings of Machine Learning and Systems 2024

Abs Bib HTML PDF Blog Code

Transformers have emerged as the underpinning architecture for Large Language Models (LLMs). In generative language models, the inference process involves two primary phases: prompt processing and token generation. Token generation, which constitutes the majority of the computational workload, primarily entails vector-matrix multiplications and interactions with the Key-Value (KV) Cache. This phase is constrained by memory bandwidth due to the overhead of transferring weights and KV cache values from the memory system to the computing units. This memory bottleneck becomes particularly pronounced in applications that require long-context and extensive text generation, both of which are increasingly crucial for LLMs. This paper introduces “Keyformer”, an innovative inference-time approach, to mitigate the challenges associated with KV cache size and memory bandwidth utilization. Keyformer leverages the observation that approximately 90% of the attention weight in generative inference focuses on a specific subset of tokens, referred to as “key” tokens. Keyformer retains only the key tokens in the KV cache by identifying these crucial tokens using a novel score function. This approach effectively reduces both the KV cache size and memory bandwidth usage without compromising model accuracy. We evaluate Keyformer’s performance across three foundational models: GPT-J, Cerebras-GPT, and MPT, which employ various positional embedding algorithms. Our assessment encompasses a variety of tasks, with a particular emphasis on summarization and conversation tasks involving extended contexts. Keyformer’s reduction of KV cache reduces inference latency by 2.1× and improves token generation throughput by 2.4×, while preserving the model’s accuracy.
@inproceedings{adnan2024keyformer, title = {Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference}, author = {Adnan, Muhammad and Arunkumar, Akhil and Jain, Gaurav and Nair, Prashant J. and Soloveychik, Ilya and Kamath, Purushotham}, booktitle = {Proceedings of Machine Learning and Systems}, location = {Santa Clara, CA, USA}, series = {MLSys 2024}, year = {2024}, }

2023

arXiv
Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair

2023

Abs Bib HTML PDF

Recommendation models are vital in delivering personalized user experiences by leveraging the correlation between multiple input features. However, deep learning-based recommendation models often face challenges due to evolving user behaviour and item features, leading to covariate shifts. Effective cross-feature learning is crucial to handle data distribution drift and adapting to changing user behaviour. Traditional feature interaction techniques have limitations in achieving optimal performance in this context. This work introduces Ad-Rec, an advanced network that leverages feature interaction techniques to address covariate shifts. This helps eliminate irrelevant interactions in recommendation tasks. Ad-Rec leverages masked transformers to enable the learning of higher-order cross-features while mitigating the impact of data distribution drift. Our approach improves model quality, accelerates convergence, and reduces training time, as measured by the Area Under Curve (AUC) metric. We demonstrate the scalability of Ad-Rec and its ability to achieve superior model quality through comprehensive ablation studies.
@misc{adnan2023adrec, title = {Ad-Rec: Advanced Feature Interactions to Address Covariate-Shifts in Recommendation Networks}, author = {Adnan, Muhammad and Ebrahimzadeh Maboud, Yassaman and Mahajan, Divya and Nair, Prashant J.}, year = {2023}, eprint = {2308.14902}, archiveprefix = {arXiv}, primaryclass = {cs.IR}, }
arXiv
Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training

Muhammad Adnan, Amar Phanishayee, Janardhan Kulkarni, Prashant J. Nair, and Divya Mahajan

2023

Abs Bib HTML PDF

In this paper, we present a novel technique to search for hardware architectures of accelerators optimized for end-to-end training of deep neural networks (DNNs). Our approach addresses both single-device and distributed pipeline and tensor model parallel scenarios, latter being addressed for the first time. The search optimized accelerators for training relevant metrics such as throughput/TDP under a fixed area and power constraints. However, with the proliferation of specialized architectures and complex distributed training mechanisms, the design space exploration of hardware accelerators is very large. Prior work in this space has tried to tackle this by reducing the search space to either a single accelerator execution that too only for inference, or tuning the architecture for specific layers (e.g., convolution). Instead, we take a unique heuristic-based critical path-based approach to determine the best use of available resources (power and area) either for a set of DNN workloads or each workload individually. First, we perform local search to determine the architecture for each pipeline and tensor model stage. Specifically, the system iteratively generates architectural configurations and tunes the design using a novel heuristic-based approach that prioritizes accelerator resources and scheduling to critical operators in a machine learning workload. Second, to address the complexities of distributed training, the local search selects multiple (k) designs per stage. A global search then identifies an accelerator from the top-k sets to optimize training throughput across the stages. We evaluate this work on 11 different DNN models. Compared to a recent inference-only work Spotlight, our method converges to a design in, on average, 31x less time and offers 12x higher throughput. Moreover, designs generated using our method achieve 12% throughput improvement over TPU architecture.
@misc{adnan2023wham, title = {Workload-Aware Hardware Accelerator Mining for Distributed Deep Learning Training}, author = {Adnan, Muhammad and Phanishayee, Amar and Kulkarni, Janardhan and Nair, Prashant J. and and Mahajan, Divya}, year = {2023}, eprint = {2308.14902}, archiveprefix = {arXiv}, primaryclass = {cs.AR}, }

2021

VLDB
Accelerating Recommendation System Training by Leveraging Popular Choices

Muhammad Adnan, Yassaman Ebrahimzadeh Maboud, Divya Mahajan, and Prashant J. Nair

In Proceedings of the 48th International Conference on Very Large Data Bases (VLDB) 2021

Abs Bib HTML PDF Code

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items’ and users’ categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.
@inproceedings{adnan2022hotembeddings, title = {Accelerating Recommendation System Training by Leveraging Popular Choices}, author = {Adnan, Muhammad and Ebrahimzadeh Maboud, Yassaman and Mahajan, Divya and Nair, Prashant J.}, booktitle = {Proceedings of the 48th International Conference on Very Large Data Bases (VLDB)}, location = {Sydney, Australia}, series = {VLDB 2022}, doi = {10.14778/3485450.3485462}, issn = {2150-8097}, year = {2021}, issue_date = {September 2021}, publisher = {VLDB Endowment}, journal = {Proc. VLDB Endow.}, volume = {15}, number = {1}, }