Quick Summary

We're looking for a skilled engineer to join our team in Freiburg, Germany, to work on developing production applications for our FLUX models.

Required Skills

Large Language Models Model Development Model Training Model Integration Strategy Optimization Content Moderation Research AI Knowledge Multimodal Vision-Language Models LLM Implementations Prompt Enhancement

Job Description

What if the future of generative AI isn't just better images or better text, but models that understand both—and use that understanding to create in ways neither modality could alone?

Our founding team pioneered Latent Diffusion and Stable Diffusion - breakthroughs that made generative AI accessible to millions. Today, our FLUX models power creative tools, design workflows, and products across industries worldwide.

Our FLUX models are best-in-class not only for their capability, but for ease of use in developing production applications. We top public benchmarks and compete at the frontier - and in most instances we're winning.

If you're relentlessly curious and driven by high agency, we want to talk.

With a team of ~50, we move fast and punch above our weight. From our labs in Freiburg - a university town in the Black Forest - and San Francisco, we're building what comes next.

But here's the frontier we're exploring: vision-language models that don't just caption images or generate from prompts, but truly understand the relationship between visual and linguistic information. Models that can enhance prompts intelligently, moderate content contextually, and unlock generative capabilities we haven't imagined yet. That's the research you'll lead.

What You'll Pioneer

You'll run cutting-edge projects in multimodal vision-language and large language models, integrating them into our media generation pipeline in ways that push beyond what either modality could achieve alone. This isn't about implementing existing VLMs—it's about developing novel approaches that make FLUX more powerful, more controllable, and more aligned with what creators actually need.

You'll be the person who:

• Leads the development and training of state-of-the-art multimodal vision-language models within the FLUX technology stack—not just applying existing architectures, but innovating on them

• Designs and implements specialized fine-tuning strategies for VLMs to address specific use cases and performance requirements that general-purpose models can't handle

• Develops and optimizes LLM implementations for prompt enhancement, content moderation, and novel applications that improve how people interact with generative models

• Drives innovation by integrating VLM/LLM capabilities into our media generation pipeline in creative ways that enhance generative capabilities

• Conducts research to creatively combine vision and language models—exploring questions about how these modalities can inform and improve each other

• Maintains cutting-edge knowledge of the latest developments in multimodal AI and LLM research, evaluating emerging models and architectures for potential integration

• Collaborates with cross-functional teams to implement and deploy models at scale, contributing to architectural decisions and technical roadmap planning

• Documents and shares research findings with the broader team, translating breakthroughs into practical improvements

Questions We're Wrestling With

• How can vision-language models improve prompt understanding in ways that make generation more controllable and aligned with user intent?

• What's the right architecture for integrating VLMs into diffusion model workflows without creating computational bottlenecks?

• How do you fine-tune vision-language models for specialized creative tasks that weren't in the training data?

• Where can LLMs enhance the generative pipeline—prompt rewriting, content moderation, parameter suggestion—and where would they add more friction than value?

• What novel capabilities emerge when you deeply integrate vision and language understanding into generative workflows?

• How do you evaluate whether multimodal models are actually improving generation quality versus just adding complexity?

These aren't solved problems—they're research directions we're actively exploring.

Who Thrives Here

You've trained and fine-tuned large-scale vision-language models and understand the nuances of multimodal learning. You have strong intuitions about what makes VLMs work well, backed by either publications or practical projects that pushed the field forward. You're comfortable operating at the intersection of research and production, where models need to be both innovative and deployable.

You likely have:

• Demonstrated expertise in training and fine-tuning large-scale vision-language models—not just using pre-trained ones, but developing them

• Strong publication record or practical experience with relevant projects in multimodal AI research that shows you can push the frontier

• Proficiency in PyTorch or similar deep learning frameworks with deep understanding of their capabilities and limitations

• Experience with distributed training systems and large-scale model optimization—because VLMs don't fit on one GPU

• Track record of implementing and scaling AI models in production environments where research meets real-world constraints

We'd be especially excited if you:

• Have experience with diffusion models and generative AI architectures alongside autoregressive modeling—understanding how different paradigms can complement each other

• Bring a background in computer vision that informs your approach to multimodal models

• Contribute to open-source AI projects and understand the community

• Have worked in fast-paced startup environments where iteration speed matters

• Bring strong software engineering practices and system design skills

• Have experience with open-source VLM inference frameworks like vLLM

What We're Building Toward

We're not just adding VLMs to our stack—we're exploring fundamental questions about how vision and language understanding can make generative models more powerful and more aligned with human intent. Every model you train teaches us something about multimodal learning. Every integration reveals new capabilities. Every research finding shapes where the field goes next. If that sounds more compelling than applying existing techniques, we should talk.

We're based in Europe and value depth over noise, collaboration over hero culture, and honest technical conversations over hype. Our models have been downloaded hundreds of millions of times, but we're still a ~50-person team learning what's possible at the edge of generative AI.

Member of Technical Staff - Multimodal VLM/LLM

Interested in this position?

Quick Summary

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?