Machine Learning Engineer — Inference Optimization
Featherlessai
Posted: January 22, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Required Skills
Job Description
About the Role
We’re looking for a Machine Learning Engineer to own and push the limits of model inference performance at scale. You’ll work at the intersection of research and production—turning cutting-edge models into fast, reliable, and cost-efficient systems that serve real users.
This role is ideal for someone who enjoys deep technical work, profiling systems down to the kernel/GPU level, and translating research ideas into production-grade performance gains.
What You’ll Do
• Optimize inference latency, throughput, and cost for large-scale ML models in production
• Profile and bottleneck GPU/CPU inference pipelines (memory, kernels, batching, IO)
• Implement and tune techniques such as:
• Quantization (fp16, bf16, int8, fp8)
• KV-cache optimization & reuse
• Speculative decoding, batching, and streaming
• Model pruning or architectural simplifications for inference
• Collaborate with research engineers to productionize new model architectures
• Build and maintain inference-serving systems (e.g. Triton, custom runtimes, or bespoke stacks)
• Benchmark performance across hardware (NVIDIA / AMD GPUs, CPUs) and cloud setups
• Improve system reliability, observability, and cost efficiency under real workloads
What We’re Looking For
• Strong experience in ML inference optimization or high-performance ML systems
• Solid understanding of deep learning internals (attention, memory layout, compute graphs)
• Hands-on experience with PyTorch (or similar) and model deployment
• Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations)
• Experience scaling inference for real users (not just research benchmarks)
• Comfortable working in fast-moving startup environments with ownership and ambiguity
Nice to Have
• Experience with LLM or long-context model inference
• Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton)
• Experience optimizing across different hardware vendors
• Open-source contributions in ML systems or inference tooling
• Background in distributed systems or low-latency services
Why Join Us
• Real ownership over performance-critical systems
• Direct impact on product reliability and unit economics
• Close collaboration with research, infra, and product
• Competitive compensation + meaningful equity at Series A
• A team that cares about engineering quality, not hype