At ellamind, we build evaluation-first AI infrastructure. Our platform elluminate turns AI evaluation from ad-hoc “vibe checks” into rigorous, repeatable engineering to enable teams to test, measure, and improve LLM applications with confidence.

What you'll do

Advance LLM evaluation research: Design, implement, and validate new benchmarks, metrics, and workflows that measure correctness, robustness, safety, and reliability. Across languages and modalities.

Build LLM-as-a-judge setups and reward models: Develop rubric-based graders, preference data pipelines, reward models, do DPO/RLHF/RLAIF/RLVF training

Generate and curate synthetic data: Create high-quality synthetic datasets for pre-training, post-training and evaluation of LLMs with filtering, deduplication and decontamination to reliably improve model capabilities.

Train and adapt open models: Pre-train and fine-tune open-source LLMs. Use LLM training frameworks to run rigorous ablations.

Scale experiments on GPU clusters: Orchestrate large-scale training inference, and evaluation jobs. Optimize efficiency, and ensure reproducibility end-to-end. We are working with thousands of GPUs.

Multilingual data and evaluation: Extend training datasets and eval pipelines to European languages.

Open science & collaboration: Release datasets/tools, publish technical reports blog posts and papers, and collaborate with partners (e.g., OpenEuroLLM) to push evaluation standards forward.

Productize research: Turn prototypes into elluminate features—automated eval suites, graders, and data pipelines. Work with platform engineers and product to ship reliable workflows.

You’ll mostly work with a Python-based LLM research stack (Huggingface ecosystem, PyTorch, Megatron-LM/torchtitan, vLLM/SGLang, lm-eval-harness/LightEval, dataframe libraries, SLURM, Ray).

What we're looking for

Must-haves

Strong Python engineering skills: Experience building LLM-centric systems with clean, maintainable code, comprehensive testing, and performance optimization at scale.

LLM operations expertise: You’re comfortable with tokenizers/vocabs, data specs (e.g., Parquet), sampling/decoding configs, and evaluation.

Distributed training & inference literacy: Solid grasp of multi-GPU/multi-node fundamentals (e.g., FSDP/DeepSpeed), scheduling, and monitoring—plus practical debugging of throughput/memory issues.

Experiment design & statistics: You plan ablations, track experiments, and use sound statistical methods (significance testing, uncertainty estimates) to draw reliable conclusions.

Data hygiene mindset: You care about dataset quality—deduplication, contamination checks, multilingual coverage, and traceable versioning.

Linux comfort: You’re productive on Linux servers—shell workflows, virtual environments, containers, GPU tooling, logs/metrics, and remote development/debugging.

On-site collaboration: 3 days/week in Berlin or Bremen. Travel to our Bremen HQ during onboarding.

Fluency in English: At least B2 level for team collaboration and technical discussions.

Valid EU work authorization.

Nice-to-haves

Experience with LLM evaluation frameworks (lm-eval-harness, LightEval) or a track record of rigorous custom benchmarks and metrics.

Background in preference learning and reward modeling (DPO/RLHF/RLAIF), including rubric design and high-quality preference data pipelines.

Multilingual expertise: building or evaluating models across European languages; data collection, alignment, and cross-lingual transfer.

Comfort with high-throughput inference systems (vLLM, SGLang), latency/memory optimization, and model quantization.

Experience with systems and orchestration (Slurm/Ray/Kubernetes) and containers (Docker/Apptainer) – including GPU observability, scheduling, and performance tuning.

Familiarity with MLOps and reproducibility: experiment tracking (e.g., W&B), dataset/model/prompt versioning, CI for research workflows, and dependable artifact management.

Experience building open-source tools or publishing research artifacts (datasets, models, papers) or strong technical writing.

Experience working directly with partners or customers to validate results and translate research into product impact.

Advanced degree in Computer Science, Machine Learning, Data Science, or a related field (PhD preferred, or equivalent achievements).

What matters most

We prioritize demonstrated excellence in your projects and career. If you’re motivated to build and optimize AI solutions, we want to hear from you—even if you don’t meet every single criterion.

Diversity & inclusion

Different perspectives make us stronger. We welcome applicants from all backgrounds and encourage you to apply.

Why us?

Shape the future of AI research: Influence our research agenda and Europe’s LLM ecosystem—help set evaluation standards and training practices that serious AI teams and institutions rely on.

Technical excellence meets cutting-edge research: Push the frontier of LLM training and evaluation—design multilingual benchmarks, build LLM-as-a-judge and reward models, generate high-quality synthetic data, and run rigorous ablations at scale on large GPU clusters.

Career-defining opportunity: Systematic evaluation is becoming as fundamental to AI as version control is to software. Work at the center of this shift and contribute methods, datasets, and tools that others adopt and build upon.

Ownership and impact: Lead research end-to-end—formulate hypotheses, build datasets and benchmarks, run large-scale experiments, and publish results (papers, technical reports, OSS). Collaborate with top-tier partner labs and see your work shape model behavior and evaluation practices across the industry.

Compute that matches your ambition: Access serious GPU resources.

Open science by default: Freedom to release datasets, models, and tools; backing for conference submissions and travel.

Competitive package with upside: In addition to a competitive salary, we offer a VSOP (Virtual Stock Option Program) to give you a real stake in the company’s success as we grow.

Best-in-class development experience: Fast and streamlined access to all AI technologies that make your life (and development work) easier, plus the latest tools and platforms to maximize your productivity.

Work environment: Our Bremen office features stunning waterfront views, complimentary beverages, smoothies, and a boat. We’re opening our Berlin office at the end of 2025, giving you flexibility as we expand.

Grow with transformative technology: Build deep expertise in LLM evaluation and infrastructure, contribute to open standards, and advance the state of the art alongside a team that values rigor and impact.

About us

We are a cash-flow-positive Germany-based AI startup building elluminate, the enterprise platform that turns AI evaluation from ad-hoc experiments into rigorous, repeatable workflows so teams can ship reliable AI with confidence. Teams use elluminate to design test suites, benchmark models, track regressions, and ship reliable AI with clear, measurable quality gates. We pair elluminate with custom large-language-model solutions and full on-prem deployment options. Our products have already earned the trust of renowned clients such as Deutsche Telekom, the German Federal Government, and leading health insurers like hkk.

Rooted in Bremen and collaborating with leading organizations, our team has a track record in advanced model and dataset development. We like owning problems end-to-end and shipping pragmatically, and contribute to the open-source community across initiatives like OpenEuroLLM, and regularly publish models and tools to accelerate the broader ecosystem.

Compensation Range: €70.00 - €110,000.00

AI Research Engineer (all genders)

Interested in this position?

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?