AI Researcher — Inference Optimization
Featherlessai
Posted: January 23, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
Design, evaluate, and deploy high-performance inference systems for large-scale machine learning models, focusing on latency, throughput, and cost efficiency across real-world production environments.
Required Skills
Job Description
Role Overview
We are seeking an AI Researcher with deep experience in inference optimization to design, evaluate, and deploy high-performance inference systems for large-scale machine learning models. You will work at the intersection of model architecture, systems engineering, and hardware-aware optimization, improving latency, throughput, and cost efficiency across real-world production environments.
Key Responsibilities
• Research and develop techniques to optimize inference performance for large neural networks.
• Improve latency, throughput, memory efficiency, and cost per inference.
• Design and evaluate model-level optimizations (quantization, pruning, KV-cache optimization, architecture-aware simplifications).
• Implement systems-level optimizations (dynamic batching, kernel fusion, multi-GPU inference, prefill vs decode optimization).
• Benchmark inference workloads across hardware accelerators.
• Collaborate with engineering teams to deploy optimized inference pipelines.
• Translate research insights into production-ready improvements.
Required Qualifications
• Strong background in machine learning, deep learning, or AI systems.
• Hands-on experience optimizing inference for large-scale models.
• Proficiency in Python and modern ML frameworks (e.g., PyTorch).
• Experience with inference tooling (e.g., Triton, TensorRT, vLLM, ONNX Runtime).
• Ability to design experiments and communicate results clearly.
Preferred / Nice-to-Have Qualifications
• Experience deploying production inference systems at scale.
• Familiarity with distributed and multi-GPU inference.
• Experience contributing to open-source ML or inference frameworks.
• Authorship or co-authorship of peer-reviewed research papers in machine learning, systems, or related fields.
• Experience working close to hardware (CUDA, ROCm, profiling tools).
What Success Looks Like
• Measurable gains in latency, throughput, and cost efficiency.
• Optimized inference systems running reliably in production.
• Research ideas successfully translated into deployable systems.
• Clear benchmarks and documentation that inform product decisions.
Relevant Research Areas (Bonus)
• Long-context inference optimization
• Speculative decoding
• KV-cache compression and paging
• Efficient decoding strategies
• Hardware-aware inference design