Large Language Models (LLMs), including GPT-4, BERT, and LLaMA, are becoming increasingly popular. LLMs are based on the Transformer architecture, a highly computationally intensive machine learning framework. This framework enables them to analyze and generate human-like text by understanding the context and relationships between words and phrases. To enhance the performance of these models, ML researchers are dedicated to designing advanced model architectures, enhancing dataset quality, and, most importantly, increasing the model size. These methodologies have made the hardware infrastructure for LLMs’ training, fine-tuning, and inference critical. Traditional processors apparently cannot handle the LLM workload. This is where domain-specific hardware, known as accelerators, plays a significant role. The significance of hardware accelerators in the domain of Large Language Models cannot be overstated. As LLMs grow in complexity and size, the demand on computational resources escalates exponentially. Traditional CPUs, while versatile, lack the specialized capabilities required to efficiently handle the parallel processing and high throughput demands of modern LLMs. Hardware accelerators, such as GPUs and TPUs, are engineered to fill this gap. They offer massive parallelism and are optimized for the matrix and vector operations that are fundamental to machine learning tasks. This optimization not only dramatically reduces the time required for training and inference but also enables the processing of larger models and datasets that would be infeasible with general-purpose processors. The evolution and improvement of these accelerators are pivotal in advancing the field of AI, as they directly influence the feasibility and efficiency of training more sophisticated and capable models.

To accelerate the development of hardware accelerators, it is crucial to improve the traditional Electronic Design Automation (EDA) approaches to better meet the requirements of the hardware accelerator design workflow. It is feasible to think about hardware design in a software-centric way and apply some mature optimization approaches from software engineering to the hardware design EDA infrastructure. In Fall 2022, during a collaboration with Dr. Yu Zeng from Princeton University, we applied the idea of compiler optimization to the process of Register-Transfer Level (RTL) model generation, successfully reducing the time for cycle-accurate modeling of experimental accelerator models. More specifically, since obtaining a result from a cycle-accurate simulation on an RTL model of accelerators usually consumes a significant period, and a large ratio of this time would be taken in processing computation in the data path yet has no contribution to the cycle simulation result, we hoped this part of the RTL model could be removed during the code generation. We proposed a pass of Yosys, an open-source and highly-extensible RTL synthesis suite, to achieve the mentioned effect during code generation from high-level Verilog into low-level RTL. For the experiment, the optimized RTL model takes 1/2 the time of the baseline model simulation, on the TSIM (cycle-accurate simulation) provided by VTA, the Versatile Tensor Accelerator purposed by TVM (Moreau et al., 2019).

While designing new hardware accelerators for demanding computation tasks can be efficient sometimes, hardware-optimized algorithms implemented on existing and mature hardware platforms are usually more efficient, like utilizing optimized Graphic Processing Unit (GPU) algorithms/libraries, in the field of Machine Learning Systems (MLSys). Flash-Attention is an iconic project that considers the GPU memory hierarchy in the traditional attention algorithm and optimizes the data movement between different layers of them to enhance performance on Transformers (Dao et al., 2022). Flash-Attention has been proven to have huge potential in enhancing the efficiency of LLMs and has been adopted into the source code of PyTorch. Based on Flash-Attention, Dr. Yu Zeng and I started a project motivated by the findings of (Li et al., 2021), which propose an inverse linear relation between the tiling size of tiled matrix multiplication and the data movement through the CPU memory hierarchy. We were interested in the influence on L2 cache utilization of GPUs when varying the size of tiled sub-matrices. We intended to modify the source code to change tiling size and profile the L2 cache metrics of NVIDIA A100 with NVIDIA Nsight Compute for each run. Based on the results, a quantitative analytical model, as the findings of Li et al., should be introduced to present the relation between tiling size and the total data movement, which can be integrated into Flash-Attention as a search space for the best configuration of the tiling size. However, that did not function as we expected. We first focused on the original CUDA implementation of Flash-Attention, then on the Triton version (a Python-based CUDA alternative GPU programming language developed by OpenAI), and the profiling results of cache utilization did not match the theoretical value we derived from the source code. The failure of this project underscored a pivotal realization: despite the widespread adoption of GPUs in both research and industry, their internals remain largely enigmatic, contrasting with the well-studied microarchitecture of CPUs. This distinction necessitates a tailored approach when profiling, analyzing, and optimizing for GPUs—particularly in managing the intricate balance of computational efficiency, resource allocation, and performance.

In fall 2023, I took part in a research internship at Georgia Tech, under the mentorship of Prof. Yang (Katie) Zhao in the EIC lab. Our focus was on a hardware-software co-design project (Yu et al., 2024). The core of our project was to develop a search space for hardware scheduling that complements the specific needs of LLMs running on edge-GPU devices. This search space was designed to evaluate various configurations to find the most efficient setup for fine-tuning and inference tasks. Inspired by insights from (Sheng et al., 2023), we built an analytical cost model. This model helped us identify optimal configurations by analyzing compute schedules, block sizes, batch sizes, and sparsity ratios. Our work highlights the critical role of hardware-software synergy in maximizing the performance of machine learning systems on specialized hardware. Emerging from the shadows of previous setbacks, this project not only deepened my understanding of GPU architecture but also instilled in me a resilience and refined perspective towards optimizing GPU-based systems for advanced AI tasks, emphasizing the iterative nature of technological progress.

During my PhD studies, my research will concentrate on two pivotal areas within the realm of architecture-algorithm co-design and the enhancement of Electronic Design Automation (EDA) flows for hardware accelerators. Firstly, I aim to optimize the performance of large language models (LLMs) on existing hardware architectures, such as GPUs and accelerators. By refining scheduling and data mapping strategies, my goal is to achieve significant efficiency improvements in the execution of complex algorithms. This approach not only promises to elevate computational performance but also ensures optimal utilization of available hardware resources, thereby reducing the energy footprint of these models. Secondly, I will focus on advancing the EDA flow for designing hardware accelerators by integrating compiler support mechanisms tailored to accelerate hardware design. This will involve developing and optimizing high-level synthesis (HLS) approaches and leveraging the LLVM compiler infrastructure to enhance the generation of hardware description languages (HDLs) that are specifically concerned with accelerators. Contributing to open-source hardware design frameworks like Chipyard and PULP will allow me to push the boundaries of current technological capabilities and foster an ecosystem where innovation in hardware design is more accessible and efficient. Through these efforts, I aim to bridge the gap between software capabilities and hardware limitations, driving forward the future of hardware accelerator technology.

Reference

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems.

Li, R., Xu, Y., Sukumaran-Rajam, A., Rountev, A., & Sadayappan, P. (2021). Analytical characterization and design space exploration for optimization of CNNs. Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 928–942.

Moreau, T., Chen, T., Vega, L., Roesch, J., Yan, E., Zheng, L., Fromm, J., Jiang, Z., Ceze, L., Guestrin, C., & Krishnamurthy, A. (2019). A hardware-software blueprint for flexible deep learning specialization. https://arxiv.org/abs/1807.04188

Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). FlexGen: High-throughput generative inference of large language models with a single GPU. Proceedings of the 40th International Conference on Machine Learning.

Yu, Z., Wang, Z., Li, Y., Gao, R., Zhou, X., Bommu, S. R., Zhao, Y. (Katie), & Lin, Y. (Celine). (2024). EDGE-LLM: Enabling efficient large language model adaptation on edge devices via unified compression and adaptive layer voting. Design Automation Conference.