General Diffusion - The OS for Heterogeneous AI Compute

The Memory Wall

In modern deep learning, compute is rarely the bottleneck. Memory bandwidth is.

A typical Transformer layer consists of a sequence of operations: LayerNorm, MatMul, Bias, Gelu, MatMul, Residual Add. In a naive implementation, each operation reads data from HBM (High Bandwidth Memory), computes, and writes it back.

This "round-trip" to memory is expensive. It consumes energy and stalls the compute cores.

Automated Kernel Fusion

Kernel fusion is the technique of combining multiple operations into a single kernel launch. Instead of writing intermediate results to HBM, we keep them in the fast SRAM (L1/L2 cache) or registers.

General Diffusion's Graph Partitioner is designed to automate this process at a cluster scale.

How It Works

Subgraph Identification: The system scans the compute graph for "fusable" patterns (e.g., Pointwise operations following a MatMul).
Cost Modeling: It estimates the register pressure and shared memory usage of the fused kernel. If the fusion would cause register spilling (which hurts performance), the system optimizes for the next best configuration.
Code Generation: It generates a custom CUDA or Triton kernel that executes the fused sequence.

The Impact

By keeping data in SRAM and minimizing global memory accesses, automated kernel fusion addresses the primary bottleneck in large model inference. This approach is essential for scaling models beyond the limits of current memory bandwidth.

This is just one example of how software optimization can unlock "free" performance from existing hardware.

Kernel Fusion at Scale

The Memory Wall

Automated Kernel Fusion

How It Works

The Impact

Read Next

Universal Translator White Paper

Redefining Compute Physics