General Diffusion
Back to Home

Kernel Fusion at Scale

5 min readUpdated Jan 27, 2026

The Memory Wall

In modern deep learning, compute is rarely the bottleneck. Memory bandwidth is.

A typical Transformer layer consists of a sequence of operations: LayerNorm, MatMul, Bias, Gelu, MatMul, Residual Add. In a naive implementation, each operation reads data from HBM (High Bandwidth Memory), computes, and writes it back.

This "round-trip" to memory is expensive. It consumes energy and stalls the compute cores.

Automated Kernel Fusion

Kernel fusion is the technique of combining multiple operations into a single kernel launch. Instead of writing intermediate results to HBM, we keep them in the fast SRAM (L1/L2 cache) or registers.

General Diffusion's Graph Partitioner is designed to automate this process at a cluster scale.

How It Works

  1. Subgraph Identification: The system scans the compute graph for "fusable" patterns (e.g., Pointwise operations following a MatMul).
  2. Cost Modeling: It estimates the register pressure and shared memory usage of the fused kernel. If the fusion would cause register spilling (which hurts performance), the system optimizes for the next best configuration.
  3. Code Generation: It generates a custom CUDA or Triton kernel that executes the fused sequence.

The Impact

By keeping data in SRAM and minimizing global memory accesses, automated kernel fusion addresses the primary bottleneck in large model inference. This approach is essential for scaling models beyond the limits of current memory bandwidth.

This is just one example of how software optimization can unlock "free" performance from existing hardware.