/Performance modeling and analysis of large language models on distributed systems

Performance modeling and analysis of large language models on distributed systems

Master projects/internships - Leuven | About a week ago

Aligning future system design with the ever-increasing complexity and scale of large language models 

Motivation

The rapid growth of AI models, particularly large language models (LLMs), has significantly increased computational demands, necessitating the scaling of compute systems and power efficiency to meet these requirements. One of the potential solutions is specialized system architecture or technology based on the workload characteristics. Thus, the architecture of AI models is now driving the design of compute infrastructure. Given the vast scale of transformer-based LLMs and the substantial volume of training data, data centers with tens of thousands of GPUs are required to facilitate the distributed training of these models. Achieving an optimal setup involves extensive preliminary exploration of network architecture, hyperparameters, and distributed parallelism strategies, making this process highly time- and energy-intensive. Whereas, an analytical modeling framework enables quick analysis and evaluation of how workloads or algorithms interact with a computing system, creating opportunities for HW-SW co-design. Furthermore, a versatile analytical performance modeling framework can guide the development of next-generation systems and technologies tailored for LLMs. 

Project description

In the computer system architecture (CSA) department, we have developed a performance modelling framework for distributed LLM training and inference, named Optimus [1]. This framework is designed for performance prediction and design exploration of LLMs across various compute platforms, such as GPUs and TPUs. Optimus supports cross-layer analysis, spanning from algorithms to hardware technology, and facilitates automated design space exploration and co-optimization across multiple layers of the computational stack. It emulates the task graph of LLM implementations and integrates key features, including various mapping and parallelism strategies (data, tensor, pipeline, sequence), activation recomputation techniques, collective communication patterns, and KV caching.
 
Built upon the Optimus framework, this internship project will focus on analytical performance modelling for one or more state-of-the-art optimization algorithms, including,

  • Flash attention algorithms [2] to reduce memory access frequency 
  • Mixed of experts (MoE) [3] for efficient model scaling
  • Zero parallelism [4] for memory optimization
  • Fully sharded data parallel [5] for model distribution across data parallel workers
  • MatMul-free LLMs [6]

Research efforts might involve algorithmic analysis, understanding of the Pytorch implementations of the above features, and detailed profiling to enable performance prediction. 
 
The requirements for the ideal candidate:

  • Proficiency in Python
  • Experience with Pytorch framework. 
  • Knowledge of hardware (e.g., GPUs) microarchitectures 
  • Strong understanding of LLM architectures and implementations.

Reference

[1] J. Kundu et al., Performance Modeling and Workload Analysis of Distributed Large Language Model Training and Inference, IISWC, 2024.
[2] T. Dao et al., FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, NeurIPS, 2022.
[3] S. Rajbhandari et al., ZeRO: Memory Optimizations Toward Training Trillion
Parameter Models, Supercomputing, 2020.
[4] D. Lepikhin et al., GShard: Scaling Giant Models with Conditional
Computation and Automatic Sharding, ICLR, 2021
[5] Y. Zhao et al., PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel, arXiv:2304.11277, 2023.
[6] Rui-Jie Zhu et al., “Scalable MatMul-free Language Modeling”, https://arxiv.org/pdf/2406.02528
 

Type of Project: Internship; Thesis 

Master's degree: Master of Science; Master of Engineering Science 

Duration: 6 - 9 months 

For more information or application, please contact Wenzhe Guo (wenzhe.guo@imec.be) and Joyjit Kundu (joyjit.kundu@imec.be).

 

Imec allowance will be provided. 

Who we are
Accept marketing-cookies to view this content.
Cookie settings
imec inside out
Accept marketing-cookies to view this content.
Cookie settings

Send this job to your email