ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Member of Technical Staff - Large Model Data

Blackforestlabs

Freiburg (Germany), San Francisco (USA) (Freiburg, San Francisco) Remote permanent

Posted: November 7, 2025

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

Pioneer the creation of high-quality, frontier data systems that drive breakthrough models.

Job Description

What if the bottleneck to better generative models isn't architecture or compute, but the quality and scale of the data we train on?

We're the ~50-person team behind Stable Diffusion, Stable Video Diffusion, and FLUX.1—models with 400M+ downloads. But here's what we've learned: breakthrough models require breakthrough datasets. Not just big datasets—carefully curated, properly processed, deeply understood datasets that push models toward capabilities they couldn't achieve otherwise. That's the infrastructure you'll build.

What You'll Pioneer

You'll create the data systems that make frontier research possible. This isn't traditional data engineering—it's building infrastructure at a scale where billion-image datasets are normal, where video processing pipelines need to run across thousands of GPUs, and where understanding what's in your data is as important as collecting it.

You'll be the person who:

• Develops and maintains scalable infrastructure for acquiring massive-scale image and video datasets—the kind where "large" means billions of assets, not millions

• Manages and coordinates data transfers from licensing partners, turning heterogeneous sources into training-ready pipelines

• Implements and deploys state-of-the-art ML models for data cleaning, processing, and preparation—because at our scale, manual curation isn't an option

• Builds scalable tools to visualize, cluster, and deeply understand what's actually in our datasets (because you can't fix what you can't see)

• Optimizes and parallelizes data processing workflows to handle billion-scale datasets efficiently across both CPUs and GPUs

• Ensures data quality, diversity, and proper annotation—including captioning systems that make training datasets actually useful

• Transforms user preference data and alternative sources into formats that models can learn from

• Works directly in the model development loop, updating datasets as training trajectories reveal what we're missing

Questions We're Wrestling With

• How do you deduplicate billions of images without accidentally removing the edge cases that make models interesting?

• What does "data quality" actually mean when you're training generative models—and how do you measure it at scale?

• How do you caption video data in ways that capture temporal dynamics, not just individual frames?

• Where are the hidden biases in our datasets, and how do we surface them before they become model biases?

• When does adding more data help, and when does it just add noise?

• How do we build data pipelines that adapt as model requirements change mid-training?

These questions don't have textbook answers—we're figuring them out as we go.

Who Thrives Here

You understand that data engineering at research scale is fundamentally different from traditional data engineering. You've built pipelines that broke, debugged them at scale, and emerged with opinions about what works. You know the difference between data that looks good and data that actually trains well.

You likely have:

• Strong proficiency in Python and experience with various file systems for data-intensive manipulation and analysis

• Hands-on familiarity with cloud platforms (AWS, GCP, or Azure) and Slurm/HPC environments for distributed data processing

• Experience with image and video processing libraries (OpenCV, FFmpeg, etc.) and an understanding of their performance characteristics

• Demonstrated ability to optimize and parallelize data workflows across both CPUs and GPUs—because at our scale, inefficient code is unusable code

• Familiarity with data annotation and captioning processes for ML training datasets

• Knowledge of machine learning techniques for data cleaning and preprocessing (because heuristics only get you so far)

We'd be especially excited if you:

• Have built or contributed to large-scale data acquisition systems and understand the operational challenges

• Bring experience with NLP techniques for image/video captioning

• Have implemented data deduplication at billion-record scale and understand the tradeoffs

• Know your way around big data frameworks like Apache Spark or Hadoop

• Have been part of shipping a state-of-the-art model and understand how data decisions impact training outcomes

• Think deeply about ethical considerations in data collection and usage

What We're Building Toward

We're not just processing data—we're building the foundation that determines what our models can learn. Every pipeline optimization makes training faster. Every data quality improvement makes models better. Every new data source opens new possibilities. If that sounds more compelling than maintaining existing systems, we should talk.

Base Annual Salary: $180,000–$300,000 USD

We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply