ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Senior Distributed Systems Engineer

Ifm Us

Sunnyvale, CA permanent

Posted: March 3, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

Designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models, with a focus on performance, fault tolerance, and scalability. The ideal candidate should have expertise in distributed systems, computer science, and AI, with experience in designing and optimizing systems for large-scale training workloads.

Job Description

About the Institute of Foundation Models
The Institute of Foundation Models (IFM) designs and operates ultra-scale GPU supercomputing systems to train next-generation foundation models. We believe performance, fault tolerance, and scalability are co-designed across model architecture, communication systems, runtime, and hardware topology.
This role sits at the core of that effort — driving communication performance, distributed reliability, and cross-layer optimization for large-scale training workloads.

The Mission
We are looking for a deeply technical engineer to co-design and optimize the communication stack for large-scale distributed training, including hybrid parallelism and Mixture-of-Experts (MoE) workloads.
This is not a network operations role. This is a systems-level engineering position focused on performance engineering, distributed debugging, and communication-runtime co-design.
· Design and optimize expert-parallel and hybrid-parallel communication patterns
· Drive high-performance hierarchical collectives for MoE workloads
· Co-design runtime orchestration with communication topology awareness
· Reduce tail latency and improve determinism across thousands of GPUs
· Architect fault-tolerant distributed execution under real-world cluster failures
Core Technical Scope
· Communication-compute overlap and topology-aware collective optimization
· Deep debugging of NCCL, RDMA, and custom communication layers
· Hybrid expert parallel strategies in modern large-scale MoE systems
· Elastic and resilient distributed job orchestration concepts
· Congestion analysis and routing optimization across InfiniBand/RoCE fabrics
· Microbenchmarking and performance modeling for communication-heavy workloads
Expected Technical Depth
· Hybrid expert parallel communication for Mixture-of-Experts training
· Scaling behavior under network pressure
· Distributed orchestration for elastic, large-scale training
· Fault detection and recovery in distributed GPU workloads
· Cross-layer bottlenecks: GPU ↔ NIC ↔ PCIe ↔ NVSwitch ↔ Fabric ↔ Scheduler
Required Background
· Experience optimizing distributed training at 1,000+ GPU scale (or equivalent depth)
· Hands-on expertise with RDMA, InfiniBand, RoCE, and GPUDirect RDMA
· Deep familiarity with NCCL and/or UCX internals
· Strong systems programming ability (C/C++, Rust, or Go)
· Strong familiarity with modern model training frameworks such as PyTorch
· Ability to troubleshoot and profile training performance issues related to communication bottlenecks
· Ability to translate research ideas into production-grade optimizations
· Experience debugging distributed hangs, desynchronization, and performance regressions
What We Mean by "Hardcore"
· You can explain why an communication degrades at scale and how to fix it
· You have improved real cluster throughput via communication redesign
· You can trace a distributed hang across ranks and identify the root cause
· You are comfortable working at the boundary between hardware and runtime
Application Requirements
· Include a link to your GitHub (required)
· Provide links to relevant distributed systems, HPC, or large-scale training projects
· Include a list of publications and/or public technical reports (if applicable)
· Describe the hardest distributed debugging problem you solved
· Include measurable performance improvements you have delivered


Visa Sponsorship
This position is eligible for visa sponsorship.

Benefits Include
*Comprehensive medical, dental, and vision benefits
*Bonus
*401K Plan
*Generous paid time off, sick leave and holidays
*Paid Parental Leave
*Employee Assistance Program
*Life insurance and disability

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply