Founding Engineer, Training Infrastructure
Engineering · San Francisco
About Principia
Principia is a research lab building foundation models for mathematical reasoning, algorithm design, and AI research: an Einstein in a box.
We're backed by Modern Capital, Pebblebed, and Neo, and by researchers and founders from OpenAI and FAIR. We've contributed core research to DeepSeek V3, FrontierMath, Kimi 1.5, and Tinker. Everyone here has real autonomy to shape what we build.
About the Role
We're looking for someone to own training infrastructure.
You'll own the distributed training framework, the cluster operations around it, and everything in between - the stack that turns research into trained models. You'll work with the pretraining team on model/system co-design. You'll debug runs that fail, and do the work to make sure they stop failing.
This is a founding role. You own training infrastructure for the company. If you want to be one of fifty people doing training infra at a frontier lab, this isn't that. If you want to set direction, write the code, and see your systems train the models Principia ships, it is.
We keep this role open. We'll hire when we meet the right person.
What You'll Do
- Design, build, and operate our distributed training stack
- Own the infrastructure around it: scheduling, data loading, storage, checkpointing, observability
- Work with the research team on model/system co-design: parallelism, communication, memory, data layout
- Debug training failures end-to-end
- Build tooling that makes researchers faster
- Grow and lead the training infrastructure team
What We're Looking For
Required:
- You've personally shepherded large-scale distributed training runs. You know what a bad loss curve looks like at hour 18, and where to look when a run silently stalls.
- Fluency with the modern training stack: PyTorch distributed, Megatron-LM or DeepSpeed.
- Systems fundamentals: GPU architecture, interconnects, memory hierarchies, parallelism patterns.
- You bridge systems and ML. You're not primarily a researcher, but you understand what researchers are doing and why it matters for system design.
- You own problems end-to-end and ship through ambiguity.
Strongly preferred (apply even if you meet only some):
- Contributions to open-source training infrastructure: Megatron-LM, TorchTitan, DeepSpeed, Slime, veRL, Ray, or similar.
- Published systems work at MLSys, OSDI, NSDI, ASPLOS, or similar.
- Experience with MoE training.
- Kernel-level experience: Triton, custom CUDA kernels, mixed-precision arithmetic.
- Prior experience building or leading a training infrastructure team.
This Role Is Not
- Pure cluster / Kubernetes platform engineering. We want someone who bridges systems and ML, not a pure cloud SRE.
- Algorithmic ML research. The pretraining team owns research. You make it possible at scale.
- Pure kernels work. We may hire for this separately.
If those are the jobs you want, we're probably not the right fit. If you want the work that sits between all three, we are.
Compensation and Logistics
- Location: San Francisco. In person.
- Base compensation: 600,000
- Equity: Significant. Founding-level.
- Total compensation is competitive with frontier lab offers; the equity makes it materially higher over time.
- Visa sponsorship: Yes. We've sponsored visas for team members before.
- Benefits: Health, dental, vision, 401(k), equipment, relocation support.
How to Apply
Email hiring@principialabs.org. Include:
- Your resume or LinkedIn
- Your GitHub. If you've contributed to open-source training infrastructure, link the PRs you're proud of.
- One paragraph on a hard training systems problem you debugged, and what you learned.
No cover letter. We read every email.