Quick Summary

We are looking for a skilled engineer to join our team in Freiburg, Germany. The ideal candidate will have expertise in training cluster engineering and be able to work with Latent Diffusion and Stable Diffusion models. The successful candidate will be responsible for developing production applications using our FLUX models.

Required Skills

Cloud Docker Kubernetes SLURM Integration Health Monitoring GPU Architecture Security Developer Tools Performance Optimization

Job Description

What if the difference between a research breakthrough and a failed experiment is whether your GPUs are actually doing what you think they're doing?

Our founding team pioneered Latent Diffusion and Stable Diffusion - breakthroughs that made generative AI accessible to millions. Today, our FLUX models power creative tools, design workflows, and products across industries worldwide.

Our FLUX models are best-in-class not only for their capability, but for ease of use in developing production applications. We top public benchmarks and compete at the frontier - and in most instances we're winning.

If you're relentlessly curious and driven by high agency, we want to talk.

With a team of ~50, we move fast and punch above our weight. From our labs in Freiburg - a university town in the Black Forest - and San Francisco, we're building what comes next.

But here's the reality: those models only exist because someone kept thousands of GPUs running smoothly for weeks at a time. Training runs fail. Nodes go dark. Networks saturate. Your job is to make sure that doesn't stop us from pushing the frontier.

What You'll Pioneer

You'll build and maintain the computational infrastructure that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds or fails, whether researchers iterate quickly or wait hours for resources, whether we can scale to the next generation of models or hit a wall.

You'll be the person who:

• Designs, deploys, and maintains large-scale ML training clusters running SLURM for distributed workload orchestration—the backbone of everything we train

• Implements comprehensive node health monitoring with automated failure detection and recovery workflows, because at scale, something is always breaking

• Partners with cloud and colocation providers to ensure cluster availability and performance—translating between their abstractions and our requirements

• Establishes and enforces security best practices across the entire ML infrastructure stack (network, storage, compute) without creating friction for researchers

• Builds developer-facing tools and APIs that streamline ML workflows and improve researcher productivity—because infrastructure that's hard to use doesn't get used

• Collaborates directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning decisions

Questions We're Wrestling With

• How do you detect and recover from GPU failures in multi-week training runs without losing days of progress?

• What's the right balance between cluster utilization and researcher flexibility—and how do you enforce it without becoming a bottleneck?

• When a training run is using 1000+ GPUs, which failure modes matter and which can you safely ignore?

• How do you optimize NCCL and interconnect settings for models that don't fit established patterns?

• What does "high availability" actually mean for ML infrastructure, where some downtime is acceptable but data loss never is?

• How do you provide researchers with enough visibility to debug their jobs without overwhelming them with infrastructure complexity?

We're figuring these out in production, where the cost of being wrong is measured in GPU-hours.

Who Thrives Here

You've managed large-scale compute infrastructure and understand that ML training clusters are their own special kind of challenging. You've been paged at 2am because a training run failed. You've debugged why 512 GPUs are running fine but 1024 aren't. You know the difference between infrastructure that works in theory and infrastructure that works when researchers depend on it.

You likely have:

• Production experience managing SLURM clusters at scale—not just deploying them, but tuning job scheduling policies, resource allocation strategies, and federation setups

• Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments where performance actually matters

• A proven track record managing GPU clusters, including the unglamorous work of driver management and DCGM monitoring

We'd be especially excited if you:

• Understand distributed training patterns, checkpointing strategies, and data pipeline optimization well enough to help researchers debug performance issues

• Have experience with Kubernetes for containerized workloads, particularly in inference or mixed compute environments

• Know your way around high-performance interconnects (InfiniBand, RoCE) and have tuned NCCL for multi-node training

• Have managed 1000+ GPU training runs and developed deep intuition for failure modes and recovery patterns

• Are familiar with high-performance storage solutions (VAST, blob storage) and understand their performance characteristics for ML workloads

• Have run hybrid training/inference infrastructure with appropriate resource isolation

• Bring strong scripting skills (Python, Bash) and infrastructure-as-code experience

What We're Building Toward

We're not just maintaining infrastructure—we're building the computational foundation that determines what research is possible. Every hour of cluster downtime prevented is a research experiment that happens faster. Every monitoring system improved is a failure caught before it costs days of training. If that sounds more compelling than keeping existing systems running, we should talk.

Base Annual Salary: $180,000–$300,000 USD

We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Member of Technical Staff - Training Cluster Engineer

Interested in this position?

Quick Summary

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?