ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Member of Technical Staff - Training Cluster Engineer

Blackforestlabs

Freiburg (Germany), San Francisco (USA) (Freiburg, San Francisco) Remote permanent

Posted: December 4, 2025

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

We are looking for a skilled engineer to join our team in Freiburg, Germany. The ideal candidate will have expertise in training cluster engineering and be able to work with Latent Diffusion and Stable Diffusion models. The successful candidate will be responsible for developing production applications using our FLUX models.

Job Description

What if the difference between a research breakthrough and a failed experiment is whether your GPUs are actually doing what you think they're doing?

Our founding team pioneered Latent Diffusion and Stable Diffusion - breakthroughs that made generative AI accessible to millions. Today, our FLUX models power creative tools, design workflows, and products across industries worldwide.

Our FLUX models are best-in-class not only for their capability, but for ease of use in developing production applications. We top public benchmarks and compete at the frontier - and in most instances we're winning.

If you're relentlessly curious and driven by high agency, we want to talk.

With a team of ~50, we move fast and punch above our weight. From our labs in Freiburg - a university town in the Black Forest - and San Francisco, we're building what comes next.

But here's the reality: those models only exist because someone kept thousands of GPUs running smoothly for weeks at a time. Training runs fail. Nodes go dark. Networks saturate. Your job is to make sure that doesn't stop us from pushing the frontier.

What You'll Pioneer

You'll build and maintain the computational infrastructure that makes frontier AI research possible. This isn't abstract systems work—every decision you make directly impacts whether a multi-week training run succeeds or fails, whether researchers iterate quickly or wait hours for resources, whether we can scale to the next generation of models or hit a wall.

You'll be the person who:

• Designs, deploys, and maintains large-scale ML training clusters running SLURM for distributed workload orchestration—the backbone of everything we train

• Implements comprehensive node health monitoring with automated failure detection and recovery workflows, because at scale, something is always breaking

• Partners with cloud and colocation providers to ensure cluster availability and performance—translating between their abstractions and our requirements

• Establishes and enforces security best practices across the entire ML infrastructure stack (network, storage, compute) without creating friction for researchers

• Builds developer-facing tools and APIs that streamline ML workflows and improve researcher productivity—because infrastructure that's hard to use doesn't get used

• Collaborates directly with ML research teams to translate computational requirements into infrastructure capabilities and capacity planning decisions

Questions We're Wrestling With

• How do you detect and recover from GPU failures in multi-week training runs without losing days of progress?

• What's the right balance between cluster utilization and researcher flexibility—and how do you enforce it without becoming a bottleneck?

• When a training run is using 1000+ GPUs, which failure modes matter and which can you safely ignore?

• How do you optimize NCCL and interconnect settings for models that don't fit established patterns?

• What does "high availability" actually mean for ML infrastructure, where some downtime is acceptable but data loss never is?

• How do you provide researchers with enough visibility to debug their jobs without overwhelming them with infrastructure complexity?

We're figuring these out in production, where the cost of being wrong is measured in GPU-hours.

Who Thrives Here

You've managed large-scale compute infrastructure and understand that ML training clusters are their own special kind of challenging. You've been paged at 2am because a training run failed. You've debugged why 512 GPUs are running fine but 1024 aren't. You know the difference between infrastructure that works in theory and infrastructure that works when researchers depend on it.

You likely have:

• Production experience managing SLURM clusters at scale—not just deploying them, but tuning job scheduling policies, resource allocation strategies, and federation setups

• Hands-on experience with Docker, Enroot/Pyxis, or similar container runtimes in HPC environments where performance actually matters

• A proven track record managing GPU clusters, including the unglamorous work of driver management and DCGM monitoring

We'd be especially excited if you:

• Understand distributed training patterns, checkpointing strategies, and data pipeline optimization well enough to help researchers debug performance issues

• Have experience with Kubernetes for containerized workloads, particularly in inference or mixed compute environments

• Know your way around high-performance interconnects (InfiniBand, RoCE) and have tuned NCCL for multi-node training

• Have managed 1000+ GPU training runs and developed deep intuition for failure modes and recovery patterns

• Are familiar with high-performance storage solutions (VAST, blob storage) and understand their performance characteristics for ML workloads

• Have run hybrid training/inference infrastructure with appropriate resource isolation

• Bring strong scripting skills (Python, Bash) and infrastructure-as-code experience

What We're Building Toward

We're not just maintaining infrastructure—we're building the computational foundation that determines what research is possible. Every hour of cluster downtime prevented is a research experiment that happens faster. Every monitoring system improved is a failure caught before it costs days of training. If that sounds more compelling than keeping existing systems running, we should talk.

Base Annual Salary: $180,000–$300,000 USD

We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply