Senior ML Infrastructure Engineer
Ellison Institute of Technology
Posted: December 12, 2025
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
Join our team of experts in translating scientific discovery into real-world impact. As a Senior ML Infrastructure Engineer, you'll work on projects that require a deep understanding of ML and infrastructure, with a focus on health, medical science, and sustainable agriculture.
Required Skills
Job Description
At the Ellison Institute of Technology (EIT), we’re on a mission to translate scientific discovery into real world impact. We bring together visionary scientists, technologists, policy makers, and entrepreneurs to tackle humanity’s greatest challenges in four transformative areas:
• Health, Medical Science & Generative Biology
• Food Security & Sustainable Agriculture
• Climate Change & Managing CO₂
• Artificial Intelligence & Robotics
This is ambitious work - work that demands curiosity, courage, and a relentless drive to make a difference. At EIT, you’ll join a community built on excellence, innovation, tenacity, trust, and collaboration, where bold ideas become real-world breakthroughs. Together, we push boundaries, embrace complexity, and create solutions to scale ideas for lab to society. Explore more at www.eit.org
Requirements:
Our MLOps team
Join our MLOps team to build the cloud and compute foundation that enables scientific breakthroughs. Deliver reliable, secure platforms and self-service guardrails that accelerate experimentation and turn ideas into results—faster, at scale, and with confidence.
Day-to-day, you might:
• Build, operate, and continuously optimise our high-performance GPU training and inference clusters, focusing on robust, high-availability scheduling, isolation, and automated lifecycle management.
• Drive systems design and implementation for high-throughput data paths, optimising I/O, caching, and data locality across compute and storage (including our current Lustre implementation).
• Proactively benchmark, profile, and resolve performance bottlenecks across the compute, network, and orchestration layers to maximise efficiency for distributed training and inference.
• Establish comprehensive observability, resilience, and automated security controls to ensure compliance and robust operation of sensitive research environments.
• Partner with Research, Data, and Applied teams to forecast capacity and cost for GPU and storage needs, setting quotas and streamlining ML experimentation pipelines.
What makes you a great fit:
• Proven experience leading the design, build, and operation of high-performance ML compute clusters at scale
• A proactive, autonomous approach to systems design and the proven ability and desire to ideate, co-create and implement optimal solutions
• Exposure to migrating or transforming ML infrastructure from traditional schedulers to modern, containerised systems
• Expertise with high-throughput storage systems for ML/HPC workloads
• Expert-level understanding of GPU architecture, high-speed networking for distributed training, and performance profiling to resolve bottlenecks
• A solid grasp of IaC and CI/CD practices (e.g., Terraform, Argo CD)
Benefits:
We offer the following salary and benefits:
Enhanced holiday pay
Pension
Life Assurance
Income Protection
Private Medical Insurance
Hospital Cash Plan
Therapy Services
Perk Box
Electric Car Scheme
--
Why work for EIT:
At the Ellison Institute, we believe a collaborative, inclusive team is key to our success. We are building a supportive environment where creative risks are encouraged, and everyone feels heard. Valuing emotional intelligence, empathy, respect, and resilience, we encourage people to be curious and to have a shared commitment to excellence. Join us and make an impact!