As LLMs grow in complexity and size, traditional processors struggle to handle their workload efficiently. This challenge has spurred advancements in two critical areas: hardware description languages (HDL) and logic infrastructure for designing specialized accelerators, and optimized data flow mapping for existing hardware.

To accelerate the development of hardware accelerators, it is crucial to improve the traditional Electronic Design Automation (EDA) approaches to better meet the requirements of the hardware accelerator design workflow. It is feasible to think about hardware design in a software-centric way and apply some mature optimization approaches from software engineering to the hardware design EDA infrastructure. In Fall 2022, during a collaboration with Dr. Yu Zeng from Princeton University, we applied the idea of compiler optimization to the process of Register-Transfer Level (RTL) model generation, successfully reducing the time for cycle-accurate modeling of experimental accelerator models. More specifically, we extended the FIRRTL Codegen feature, improving the Dead-Code Elimination (DCE) with better optimization to the memory module. We also developed an extension for Yosys, an open-source RTL synthesis suite, that performs constraint propagation to generate both reduced and cycle-accurate RTL models from RTL netlists generated by FIRRTL compiler. For the experiment, we compared the simulation time of the original RTL model and the optimized RTL model. The results, tested on multiple Chisel designs including CPUs, FPUs, and accelerators, showed that the reduced RTL code can save up to 2x simulation time. This work contributed to research that led to the publication “Automatic Generation of Cycle-Accurate Timing Models from RTL for Hardware Accelerators” (ICCAD 2024), demonstrating the significant impact of our approach on improving the efficiency of hardware accelerator design and simulation.

While designing new hardware accelerators can be effective, optimizing algorithms for existing mature hardware platforms often yields more immediate efficiency gains in Machine Learning Systems (MLSys). A prime example is Flash-Attention, which optimizes data movement across GPU memory hierarchy for attention mechanisms in Transformers (Dao et al., 2022). This project has significantly enhanced LLM efficiency and has been integrated into PyTorch, demonstrating its impact on the field. Based on Flash-Attention, Dr. Yu Zeng and I started a project motivated by the findings of (Li et al., 2021), which propose an inverse linear relation between the tiling size of tiled matrix multiplication and the data movement through the CPU memory hierarchy. We were interested in the influence on L2 cache utilization of GPUs when varying the size of tiled sub-matrices. We intended to modify the source code to change tiling size and profile the L2 cache metrics of NVIDIA A100 with NVIDIA Nsight Compute for each run. Based on the results, a quantitative analytical model, as the findings of Li et al., should be introduced to present the relation between tiling size and the total data movement, which can be integrated into Flash-Attention as a search space for the best configuration of the tiling size. However, that did not function as we expected. We first focused on the original CUDA implementation of Flash-Attention, then on the Triton version (a Python-based CUDA alternative GPU programming language developed by OpenAI), and the profiling results of cache utilization did not match the theoretical value we derived from the source code. The failure of this project underscored a pivotal realization: despite the widespread adoption of GPUs in both research and industry, their internals remain largely enigmatic, contrasting with the well-studied microarchitecture of CPUs. This distinction necessitates a tailored approach when profiling, analyzing, and optimizing for GPUs—particularly in managing the intricate balance of computational efficiency, resource allocation, and performance.

In fall 2023, I took part in a research internship at Georgia Tech, under the mentorship of Prof. Yang (Katie) Zhao in the EIC lab. Our focus was on a hardware-software co-design project (Yu et al., 2024). The core of our project was to develop a search space for hardware scheduling that complements the specific needs of LLMs running on edge-GPU devices. This search space was designed to evaluate various configurations to find the most efficient setup for fine-tuning and inference tasks. Inspired by insights from (Sheng et al., 2023), we built an analytical cost model. This model helped us identify optimal configurations by analyzing compute schedules, block sizes, batch sizes, and sparsity ratios. Our work highlights the critical role of hardware-software synergy in maximizing the performance of machine learning systems on specialized hardware. Emerging from the shadows of previous setbacks, this project not only deepened my understanding of GPU architecture but also instilled in me a resilience and refined perspective towards optimizing GPU-based systems for advanced AI tasks, emphasizing the iterative nature of technological progress.

During my PhD studies, my research will concentrate on two pivotal areas within the realm of architecture-algorithm co-design and the enhancement of Electronic Design Automation (EDA) flows for hardware accelerators. Firstly, I aim to optimize the performance of large language models (LLMs) on existing hardware architectures, such as GPUs and accelerators. By refining scheduling and data mapping strategies, my goal is to achieve significant efficiency improvements in the execution of complex algorithms. This approach not only promises to elevate computational performance but also ensures optimal utilization of available hardware resources, thereby reducing the energy footprint of these models. Secondly, I will focus on advancing the EDA flow for designing hardware accelerators by integrating compiler support mechanisms tailored to accelerate hardware design. This will involve developing and optimizing high-level synthesis (HLS) approaches and leveraging the LLVM compiler infrastructure to enhance the generation of hardware description languages (HDLs) that are specifically concerned with accelerators. Contributing to open-source hardware design frameworks like Chipyard and PULP will allow me to push the boundaries of current technological capabilities and foster an ecosystem where innovation in hardware design is more accessible and efficient. Through these efforts, I aim to bridge the gap between software capabilities and hardware limitations, driving forward the future of hardware accelerator technology.

Reference

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in Neural Information Processing Systems.

Li, R., Xu, Y., Sukumaran-Rajam, A., Rountev, A., & Sadayappan, P. (2021). Analytical characterization and design space exploration for optimization of CNNs. Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 928–942.

Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., Ré, C., Stoica, I., & Zhang, C. (2023). FlexGen: High-throughput generative inference of large language models with a single GPU. Proceedings of the 40th International Conference on Machine Learning.

Yu, Z., Wang, Z., Li, Y., Gao, R., Zhou, X., Bommu, S. R., Zhao, Y. (Katie), & Lin, Y. (Celine). (2024). EDGE-LLM: Enabling efficient large language model adaptation on edge devices via unified compression and adaptive layer voting. Design Automation Conference.