Kernel Fusion at Scale
The Memory Wall
In modern deep learning, compute is rarely the bottleneck. Memory bandwidth is.
A typical Transformer layer consists of a sequence of operations: LayerNorm, MatMul, Bias, Gelu, MatMul, Residual Add. In a naive implementation, each operation reads data from HBM (High Bandwidth Memory), computes, and writes it back.
This "round-trip" to memory is expensive. It consumes energy and stalls the compute cores.
Automated Kernel Fusion
Kernel fusion is the technique of combining multiple operations into a single kernel launch. Instead of writing intermediate results to HBM, we keep them in the fast SRAM (L1/L2 cache) or registers.
General Diffusion's Graph Partitioner is designed to automate this process at a cluster scale.
How It Works
- Subgraph Identification: The system scans the compute graph for "fusable" patterns (e.g., Pointwise operations following a MatMul).
- Cost Modeling: It estimates the register pressure and shared memory usage of the fused kernel. If the fusion would cause register spilling (which hurts performance), the system optimizes for the next best configuration.
- Code Generation: It generates a custom CUDA or Triton kernel that executes the fused sequence.
The Impact
By keeping data in SRAM and minimizing global memory accesses, automated kernel fusion addresses the primary bottleneck in large model inference. This approach is essential for scaling models beyond the limits of current memory bandwidth.
This is just one example of how software optimization can unlock "free" performance from existing hardware.
