ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Site Reliability Engineer

Bentoml

Beijing, China Remote permanent

Posted: July 15, 2025

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

Join BentoML as a Senior Site Reliability Engineer and take charge of the infrastructure that delivers large language model and generative AI services wo

Job Description

About BentoML

BentoML is a leading inference platform provider that helps AI teams run large language models and other generative AI workloads at scale. With support from investors such as DCM, enterprises around the world rely on us for consistent scalability and performance in production. Our portfolio includes both open source and commercial products, and our goal is to help each team build its own competitive advantage through AI.

Role

Join BentoML as a Senior Site Reliability Engineer and take charge of the infrastructure that delivers large language model and generative AI services worldwide. You will architect and operate Kubernetes clusters across AWS, Google Cloud, and on premises environments, turning vast GPU fleets into responsive inference pools. Your work will span writing clean Terraform code, refining GitOps pipelines, tuning Prometheus, and leading incident response. You will set service level objectives that matter, guide teammates through complex production challenges, and build processes that keep our platform robust and fast. If you thrive on solving difficult problems at scale and want your decisions to shape how enterprises run AI in production, this role is for you.

Responsibilities

• Kubernetes operations – design, run, and improve large multi-cluster Kubernetes environments on AWS and Google Cloud, plus on-prem clusters; add support for Azure or Oracle Cloud when needed.

• Infrastructure as code – manage everything with Terraform or Pulumi and follow GitOps workflows.

• CI/CD – keep automated build and release pipelines reliable, with safe rollback paths.

• GPU fleet management – run NVIDIA drivers, MIG partitioning, autoscaling, and firmware updates; extend the same practices to AMD GPUs when they appear.

• Observability – operate and scale Prometheus and Grafana, define SLIs/SLOs, and automate capacity tracking.

• Incident response – share an on-call rotation, lead post-incident reviews, and keep runbooks current.

• Mentorship and process building – establish standard SRE processes and teach best practices to the wider engineering team.

Qualifications

• Expert knowledge of Kubernetes internals and large-cluster administration, both cloud and on-prem.

• Hands-on experience with AWS and Google Cloud; familiarity with Azure or Oracle Cloud is a plus.

• Strong skills with Terraform or Pulumi, GitOps tools (Argo CD, Flux, or similar), and CI/CD pipelines.

• Deep understanding of Linux and networking fundamentals.

• Experience managing NVIDIA GPU clusters; AMD/ROCm knowledge is a bonus.

• Familiarity with specialized GPU clouds such as Lambda or Nebius is helpful.

• Solid background with Prometheus and Grafana at scale.

• Clear written and spoken communication and comfort working across time zones.

Why join us

• Remote work – work from where you are most productive and collaborate with teammates in North America and Asia.

• Technical scope – operate distributed LLM inference and large GPU clusters worldwide.

• Customer reach – support organizations around the globe that rely on BentoML.

• Influence – lead SRE practices and infrastructure choices.

• Compensation – competitive salary, equity, learning budget, and paid conference travel.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply