GPT-2 Kernel Fusion

Intro

This project focuses on optimizing the inference performance of the GPT-2 Small transformer model by implementing kernel fusion and memory-efficient execution strategies on GPUs. The goal is to reduce runtime overhead by minimizing kernel launches, improving memory locality, and better utilizing GPU parallelism. The work involves profiling baseline transformer execution, identifying performance bottlenecks in attention and feed-forward layers, and incrementally fusing operations such as linear projections, softmax, and elementwise transformations into custom GPU kernels. Particular attention is given to memory hierarchy, shared memory usage, register pressure, and occupancy tradeoffs. Through this project, I explore how modern transformer models execute at the systems level and how careful kernel design and scheduling decisions can significantly impact end-to-end inference latency. The project emphasizes practical performance engineering over model-level changes, aligning closely with real-world ML infrastructure and deployment concerns.

2025

View Project

Next work

Pennywise