General Diffusion - The OS for Heterogeneous AI Compute

Abstract

The fragmentation of the AI hardware ecosystem poses a fundamental barrier to scalability. As new accelerators (TPUs, LPUs, ASICs) emerge, the engineering cost of porting models from NVIDIA's CUDA ecosystem increases superlinearly.

General Diffusion is developing the Universal Translator, a Just-In-Time (JIT) compilation architecture designed to decouple model definition from hardware execution. By treating the compute graph as a mutable intermediate representation (IR), we aim to enable real-time translation of PyTorch models to hardware-native kernels without manual intervention.

The Problem: Silicon Lock-in

Current AI development is path-dependent. A model trained on NVIDIA GPUs is implicitly optimized for the CUDA memory hierarchy. Porting this model to a TPU v5p or an AMD MI300 requires:

Kernel Rewriting: Manually translating CUDA kernels to JAX/Pallas or HIP.
Memory Layout Reorganization: Adapting tensor striding for different interconnect topologies.
Scheduler Tuning: Re-optimizing dispatch logic for latency vs. throughput.

This process takes months of specialized engineering time, effectively locking organizations into a single hardware vendor.

Our Approach: The JIT Translation Layer

The Universal Translator architecture operates at the graph level, intercepting PyTorch nn.Module calls before they reach the runtime.

1. Graph Capture & Canonicalization

The system captures the dynamic computation graph, converting it into a hardware-agnostic General Intermediate Representation (GIR). GIR abstracts away device-specific constraints, focusing purely on mathematical operations and data dependencies.

python
# Conceptual GIR Representation
graph = capture_model(llama_3)
gir = canonicalize(graph)
# GIR is now purely mathematical, with no assumption of memory hierarchy

2. Hardware Profiling

Simultaneously, a Hardware Profiler scans the available cluster topology. It queries critical physical constraints:

Compute capability (FLOPS)
Memory bandwidth (HBM3e vs. GDDR6)
Interconnect latency (NVLink vs. Infinity Fabric)

3. Automated Kernel Fusion

The Graph Partitioner maps GIR subgraphs to the optimal kernels for the target hardware. It identifies opportunities for operator fusion—combining MatMul, Bias, and ReLU into a single kernel launch to minimize global memory round-trips.

4. Code Generation

Finally, the Code Generator emits optimized device code. Our roadmap includes backends for:

NVIDIA: Triton kernels tuned for Tensor Cores.
AMD: HIP/ROCm primitives.
CPUs: AVX-512/AMX instructions via Mojo.

The Economic Implication

The Universal Translator is not just a compiler; it is an economic engine. By making compute fungible, we aim to unlock a liquid market where models flow to the most efficient hardware available, regardless of vendor.

This white paper outlines our core architectural thesis. We are actively collaborating with hardware partners to validate these methodologies.

Universal Translator White Paper