Software Engineer (SDE-2) – DevOps, SRE & MLOps Platform Engineering
Location: Bengaluru
Employment Type: Full-time
Team: Platform Engineering / Reliability

About Blue Machines

Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations, operating latency-sensitive, always-on voice systems across geographies.

About the Role

We are hiring a hands-on DevOps / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering.
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale, driving uptime, performance and resilience.

Key Responsibilities

Platform Reliability & SRE

• Own 99.9%+ platform uptime for real-time Voice AI workloads.
• Participate in on-call rotations, incident response and post-incident reviews.
• Lead root cause analysis (RCA) and drive permanent reliability improvements.
• Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.

Kubernetes & Cloud Infrastructure

• Design, operate and scale Kubernetes clusters in public cloud environments.
• Work with managed Kubernetes platforms such as GKE, and apply cloud-native best practices.
• Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
• Manage infrastructure using Infrastructure as Code (Terraform).
• Optimize infrastructure for performance, reliability and cost efficiency.

Observability & Incident Intelligence

• Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and OpenTelemetry.
• Define SLIs, SLOs and error budgets for platform and AI workloads.
• Drive signal-based alerting to reduce noise and improve response quality.
• Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.

CI/CD & Platform Automation

• Design and maintain CI/CD pipelines for services and infrastructure.
• Build internal automation tooling for:
• Progressive and canary deployments
• Auto-scaling and capacity planning
• Faster incident diagnosis and recovery

• Enable self-service DevOps workflows for engineering teams.

MLOps & AI Platform Reliability

• Own reliability and performance of STT, TTS and LLM inference pipelines.
• Design provider routing, failover and SLA enforcement mechanisms.
• Deploy, version and roll back AI models and inference services.
• Monitor inference latency, quality and drift in production systems.
• Operate GPU-backed inference workloads where applicable.

Security, Compliance & Resilience

• Enforce DevSecOps practices across build and deploy pipelines.
• Implement network policies, encryption, secrets management and access controls.
• Drive disaster recovery, backup strategies and resilience testing.
• Contribute to SOC2 / ISO compliance and audits.

Collaboration & Engineering Excellence

• Partner with backend, AI and platform teams on architecture and reliability.
• Influence system design through a reliability-first mindset.
• Mentor junior engineers and raise the overall bar for operational excellence.

Qualifications

Must-Have

• 3–6 years of experience in DevOps, SRE or Platform Engineering roles.
• Strong hands-on experience with Kubernetes and Docker in production environments.
• Familiarity with public cloud platforms and managed Kubernetes services (such as GKE).
• Strong understanding of distributed systems and production debugging.
• Hands-on experience with observability systems.
• Proficiency with Infrastructure as Code (Terraform).
• Strong incident ownership and communication skills.

Good-to-Have

• Experience with MLOps or AI inference platforms.
• Familiarity with LLM pipelines, real-time streaming or telephony systems.
• Experience operating GPU workloads.
• Knowledge of AIOps, anomaly detection or intelligent alerting.
• Cloud cost optimization experience.

Why Blue Machines

• Build global-scale AI infrastructure from India.
• Operate real-time Voice AI systems with 14.5M+ minutes in production.
• Work on low-latency, high-reliability platforms.
• Grow from DevOps/SRE into MLOps and AI platform engineering.
• High ownership, deep technical impact and real production scale

MLOps and Platform Engineer (AI Platform Reliability )

Interested in this position?

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?