Back to Careers
Distributed Systems Architect
San Francisco, CA (In-Person)
General Diffusion is a foundational AI research lab establishing the scientific discipline of Compute Intelligence. We build frontier models that learn the physics of heterogeneous hardware, decoupling intelligence from infrastructure.
<br/>About the role
As a Distributed Systems Architect, you will lead the design of RS1 (Resource Scheduler Agent), the "brain" of our OS that orchestrates workloads across fragmented clusters. You will solve NP-hard scheduling problems in real-time, managing state across thousands of heterogeneous nodes with varying latency and bandwidth constraints.
<br/>What you might work on
- Designing the RS1 global scheduler to optimize for throughput/dollar across spot instances and reserved clusters.
- Building a fault-tolerant distributed state store that can survive node failures without stalling training runs.
- Implementing "live migration" for active inference contexts between different hardware types (e.g., H100 -> TPU v5).
- Optimizing the interconnect protocol to minimize serialization overhead in a multi-cloud environment.
- Creating simulation environments to stress-test the scheduler against network partitions and stragglers.
What we’re looking for
- Experience building large-scale distributed systems (Kubernetes internals, Paxos/Raft, distributed databases).
- Proficiency in Rust or Go for high-performance systems programming.
- Understanding of RDMA, InfiniBand, and modern datacenter networking.
- Experience with cluster schedulers (Slurm, Ray, Borg) is a strong plus.
- Ability to reason about consistency models and distributed consensus.
Our culture
- Compute Intelligence. We are establishing a new scientific discipline.
- Silicon Neutrality. We build foundational models that run on any chip.
- Deep Work. We value long periods of uninterrupted focus over endless meetings.
