Inference Optimization Engineer

Bentoml

San Mateo, California, United States Remote permanent

Posted: July 15, 2025

Required Skills

Job Description

About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

As an Inference Optimization Engineer, you will improve the speed and efficiency of large language models at the GPU kernel level, through the inference engine, and across distributed architectures. You will profile real workloads, remove bottlenecks, and lift each layer of the stack to new performance ceilings. Every gain you unlock will flow straight into open source code and power fleets of production models, cutting GPU costs for teams around the world. By publishing blog posts and giving conference talks you will become a trusted voice on efficient LLM inference at large scale.

Example projects:

• https://bentoml.com/blog/structured-decoding-in-vllm-a-gentle-introduction

• https://bentoml.com/blog/benchmarking-llm-inference-backends

• https://bentoml.com/blog/25x-faster-cold-starts-for-llms-on-kubernetes

Responsibilities

• Latency & throughput - Identify bottlenecks and optimize inference efficiency in single-GPU, multi-GPU, and multi-node serving setups.

• Benchmarking - Build repeatable tests that model production traffic; track and report vLLM, SGLang, TRT-LLM, and future runtimes.

• Resource efficiency - Reduce memory use and compute cost with mixed precision, better KV-cache handling, quantization, and speculative decoding.

• Serving features - Improve batching, caching, load balancing, and model-parallel execution.

• Knowledge sharing - Write technical posts, contribute code, and present findings to the open-source community.

Qualifications

• Deep understanding of transformer architecture and inference engine internals.

• Hands-on experience speeding up model serving through batching, caching, load balancing.

• Experienced with inference engines such as vLLM, SGLang, or TRT-LLM (upstream contributions are a plus).

• Experienced with inference optimization techniques: quantization, distillation, speculative decoding, or similar.

• Proficiency in CUDA and use of profiling tools like Nsight, nvprof, or CUPTI. Proficiency in Triton and ROCm is a bonus.

• Track record of blog posts, conference talks, or open-source projects in ML systems is a bonus.

Why join us

• Direct impact – optimize distributed LLM inference and large GPU clusters worldwide and cut real GPU costs.

• Technical scope – operate distributed LLM inference and large GPU clusters worldwide.

• Customer reach – support organizations around the globe that rely on BentoML.

• Influence – mentor teammates, guide open-source contributors, and become a go-to voice on efficient inference in the community.

• Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.

• Compensation – competitive salary, equity, learning budget, and paid conference travel.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Interested in this position?