General Diffusion
Back to Home

Heterogeneous Compute Physics: A New Science of Computation

Vision8 min readUpdated April 2026

The Assumption That Broke

In 2012, AlexNet ran on two NVIDIA GTX 580 GPUs. Every architectural decision that followed — batch normalization, residual connections, attention mechanisms, mixture-of-experts — was shaped by the assumption that the hardware running these algorithms would be identical. CUDA became the universal language. Homogeneous GPU clusters became the universal substrate. The entire AI software ecosystem hardened around this assumption.


That assumption is now structurally false.


Silicon architectures are proliferating beyond general-purpose GPUs into domain-specific accelerators, custom ASICs, tensor processing units, reconfigurable FPGAs, photonic processors, neuromorphic chips, and sovereign national hardware programs. Over 50 nations are building AI infrastructure as strategic assets, and no nation locks into a single vendor. The hardware landscape is fragmenting on a government timeline, accelerated by US export controls, the CHIPS Act, and bilateral compute sovereignty initiatives across the Gulf, EU, and Indo-Pacific.


The software ecosystem built for uniformity cannot be patched to handle this diversity. The assumption of homogeneity is not a feature that can be updated — it is embedded at every layer of the stack, from compiler backends to memory management to scheduling heuristics. It must be replaced by new science, new infrastructure, and new models.

The Mismatch Tax

Making heterogeneous compute work today requires armies of specialists: compiler engineers writing kernels for each processor type, hardware engineers manually profiling performance surfaces, systems engineers hand-tuning execution policies, and ML infrastructure engineers spending months adapting models to new hardware. Every serious heterogeneous deployment requires 8–15 specialist engineers to bridge hardware boundaries.


We call this the Mismatch Tax — the aggregate efficiency lost when computation runs on hardware different from its design target. It includes direct performance loss, engineering labor, opportunity cost of workloads that never deploy, and cluster underutilization. Global AI infrastructure spending exceeded $200B in 2025. Average cluster utilization runs 30–50%. The Mismatch Tax is measured in tens of billions annually — and grows superlinearly with every new chip architecture that enters production.


This labor is not scalable. The pool of engineers who can bridge hardware boundaries grows linearly. The demand grows exponentially. The gap is structural and permanent.


General Diffusion eliminates this bottleneck permanently — not by hiring more specialists, but by building the foundational models and autonomous agents that do this work automatically.

The Discipline: Heterogeneous Compute Physics

The prevailing model of computation treats it as substrate-independent — a mathematical abstraction that runs identically regardless of physical implementation. This abstraction, rooted in Turing’s universal machine and Church’s lambda calculus, served well when hardware was uniform.


When hardware is heterogeneous, the abstraction breaks. Computation becomes substrate-dependent: the physical properties of the hardware shape not just the speed of computation but the space of computations that are efficiently achievable. This is why we use the word physics — heterogeneous compute behavior is governed by the physical properties of diverse substrates interacting, not by abstract computational equivalence.


We propose that computation across physically heterogeneous hardware obeys learnable laws. These laws take the form of compositional scaling relationships: the performance of a workload subgraph on a processor class can be predicted from the subgraph’s computational profile, the processor’s behavioral model, and the interaction context. These predictions compose — the performance of a partitioned workload across heterogeneous processors can be assembled from predictions of its parts, analogous to how thermodynamic properties of mixtures can be estimated from properties of components.


This is a scientific claim, not an engineering claim. We are not asserting that heterogeneous compute can be made more efficient through better tools. We are asserting that the behavior of computation across heterogeneous hardware constitutes a lawful physical system that can be modeled, predicted, and controlled by learned systems — and that this constitutes a new scientific discipline.

The Co-Evolution Thesis

models → execution → hardware behavior → learning → improved models

Hardware shapes algorithms. Algorithms shape hardware. This is not a metaphor — it is a feedback loop with measurable dynamics.


Consider a concrete example: the Transformer architecture is itself a product of co-evolution. Attention mechanisms succeeded because GPU architectures made dense matrix multiplication cheap relative to sequential operations. The Transformer did not emerge from a hardware-neutral search over all possible architectures — it emerged from the specific computational affordances of the GPU. Had FPGAs dominated, a fundamentally different architecture — perhaps one exploiting reconfigurable logic for sparse, branching computation — might have emerged instead.


The uniformity of the GPU monoculture did not just accelerate one line of research; it suppressed alternatives. Heterogeneous Compute Physics reopens the space of possible algorithms by reopening the space of possible hardware.

The Fourth Convergence

Independent confirmation of our thesis has emerged from an unexpected direction. In the last twelve months, three research programs demonstrated that dynamic, context-aware routing across computational states produces dramatically superior outcomes compared to static pipelines:

  • Kimi (Muennighoff et al., 2026): Dynamic routing across attention layers — where the model learns which layers to activate for which tokens — produces measurably superior intelligence outcomes. The routing principle validated at the model layer.
  • DeepSeek-V3 (DeepSeek AI, 2024–2025): Mixture-of-Experts with learned routing functions outperforms dense models of equivalent compute budget. The routing principle validated at the expert layer.
  • Learned compiler heuristics (various, 2024–2025): Learned selection among optimization strategies outperforms any fixed heuristic. The routing principle validated at the compiler layer.

General Diffusion’s thesis is that this same principle holds at the physical hardware layer. Dynamic, learned routing of computation across heterogeneous processors will produce superior outcomes to any static allocation. The foundational models that learn this routing define Heterogeneous Compute Physics.

The Compute Intelligence Stack

Heterogeneous Compute Physics is the science. Compute Intelligence is the engineering discipline that makes it operational.


We are building five foundational AI models, a four-layer safety architecture, two open coordination protocols, and an autonomous agent layer:

  • HP1 — Hardware Profiler: Graph Neural Network that learns behavioral fingerprints from execution traces. Predicts performance across CPU, GPU, ASIC, and FPGA — including hardware never seen during training.
  • GP1 — Graph Partition Intelligence: Reinforcement learning agent with Graph Transformer architecture. Discovers optimal workload-to-substrate assignments for static, dynamic, and agentic workloads.
  • CM1 — Compute Model: Transformer-based world model predicting system state across three timescales: millisecond hardware events, second-scale workload migrations, minute-scale thermal dynamics.
  • PO1 — Policy Optimization: Model-based safe RL using CM1’s world model. Learns execution policies that optimize resource allocation subject to hard safety constraints.
  • CG1 — Compute Generation: Large language model pretrained on kernel code, fine-tuned with RLHF from hardware execution feedback. Every generated kernel undergoes formal semantic equivalence verification — mathematical proof, not testing.
  • RS1 — Runtime Safety: Circuit breakers nested within PO1. Formal constraint enforcement, continuous runtime validation, automatic escalation on violation.

Each model addresses a genuinely unsolved AI research problem. No single architecture can solve all five. The Compute Intelligence Stack is a compositional architecture where each model’s output feeds the next, and every deployment generates telemetry that improves all five simultaneously.

Open Protocols

Two open protocols define the coordination standard for heterogeneous compute:

  • ACP — Asymmetry Context Protocol (Open Source): Encodes hardware capabilities into a machine-readable format. Any chip vendor implements ACP to participate. Designed for extensibility to hardware architectures that do not yet exist — photonic, neuromorphic, quantum.
  • CEP — Computational Execution Protocol (Open Source): Governs how workloads cross substrate boundaries. Formally encodes execution plans in a verifiable representation. CEP guarantees that partitioning never changes what computation produces — only where it runs.

Together, ACP and CEP form the TCP/IP layer of Compute Intelligence.

Experimental Validation

The following findings were demonstrated on physical hardware — not simulations, not theoretical projections. All experiments conducted on a rack-scale system incorporating 16 AMD MI300X accelerator cards alongside FPGA, ASIC, and CPU processors.

  • Finding 1: World’s first rack-scale simultaneous orchestration of CPU, GPU, ASIC, and FPGA. ResNet-50 inference distributed across all four processor classes. Third-party confirmed.
  • Finding 2: State-of-the-art inference on non-NVIDIA hardware without code changes. Whisper V3 at 3,507 tok/s on AMD MI300X. Stable Diffusion scaling from 2.05s to 1.30s per image across 1–16 GPUs.
  • Finding 3: Cross-architecture kernel fusion deploying fused kernels across processor class boundaries. The foundation for CG1’s automated kernel generation.
  • Finding 4: Kilobyte-scale network traffic during sustained inference. Zero cloud dependency. Architecturally essential for sovereign compute markets.

These findings constitute the empirical evidence that hardware behavior is structured and learnable — the foundational claim of Heterogeneous Compute Physics.

The Path to Superintelligence

The next era of computing will not be defined by larger models or faster chips. It will be defined by how intelligence interacts with the physical structure of computation.

TCP/IP unlocked the Internet. CUDA unlocked deep learning. Heterogeneous Compute Physics unlocks everything that comes next.


Superintelligence will not emerge from bigger models alone. It will emerge when the compute those models run on is finally as intelligent as the models themselves.


The history of computing is the history of successive abstractions: assembly to high-level languages, bare metal to virtual machines, physical servers to containers. Each abstraction hid the hardware layer beneath it. Heterogeneous Compute Physics reverses this trajectory. Instead of hiding hardware behind ever-thicker abstractions, it makes hardware visible to intelligence — transforming the substrate from an implementation detail into a first-class object of scientific inquiry.


The next great abstraction is not another layer on top. It is intelligence that understands the layers beneath.


Computation across physically heterogeneous hardware obeys learnable laws. We are establishing those laws.