MLOps and Platform Engineer (AI Platform Reliability )
Apna
Posted: November 20, 2025
Interested in this position?
Create a free account to apply with AI-powered matching
Required Skills
Job Description
Software Engineer (SDE-2) – DevOps, SRE & MLOps Platform Engineering
Location: Bengaluru
Employment Type: Full-time
Team: Platform Engineering / Reliability
About Blue Machines
Blue Machines powers large-scale, real-time Voice AI platforms and Agentic Workflows for global enterprises across BFSI, Healthcare, HRTech and customer experience domains.
Built and scaled from India, our platform has processed 14.5M+ minutes of production-grade AI agent conversations, operating latency-sensitive, always-on voice systems across geographies.
About the Role
We are hiring a hands-on DevOps / SRE engineer who owns platform reliability, observability and automation and grows into MLOps and AI platform engineering.
This role focuses on designing, operating and evolving the infrastructure behind real-time Voice AI systems. You work directly on production systems at global scale, driving uptime, performance and resilience.
Key Responsibilities
Platform Reliability & SRE
• Own 99.9%+ platform uptime for real-time Voice AI workloads.
• Participate in on-call rotations, incident response and post-incident reviews.
• Lead root cause analysis (RCA) and drive permanent reliability improvements.
• Design and implement self-healing systems using automation, retries, circuit breakers and failover strategies.
Kubernetes & Cloud Infrastructure
• Design, operate and scale Kubernetes clusters in public cloud environments.
• Work with managed Kubernetes platforms such as GKE, and apply cloud-native best practices.
• Implement auto-scaling strategies (HPA, VPA, node pools, GPU workloads).
• Manage infrastructure using Infrastructure as Code (Terraform).
• Optimize infrastructure for performance, reliability and cost efficiency.
Observability & Incident Intelligence
• Build and maintain monitoring, logging and alerting systems using Prometheus, Grafana, Loki and OpenTelemetry.
• Define SLIs, SLOs and error budgets for platform and AI workloads.
• Drive signal-based alerting to reduce noise and improve response quality.
• Implement anomaly detection and predictive alerting for infrastructure and AI pipelines.
CI/CD & Platform Automation
• Design and maintain CI/CD pipelines for services and infrastructure.
• Build internal automation tooling for:
• Progressive and canary deployments
• Auto-scaling and capacity planning
• Faster incident diagnosis and recovery
• Enable self-service DevOps workflows for engineering teams.
MLOps & AI Platform Reliability
• Own reliability and performance of STT, TTS and LLM inference pipelines.
• Design provider routing, failover and SLA enforcement mechanisms.
• Deploy, version and roll back AI models and inference services.
• Monitor inference latency, quality and drift in production systems.
• Operate GPU-backed inference workloads where applicable.
Security, Compliance & Resilience
• Enforce DevSecOps practices across build and deploy pipelines.
• Implement network policies, encryption, secrets management and access controls.
• Drive disaster recovery, backup strategies and resilience testing.
• Contribute to SOC2 / ISO compliance and audits.
Collaboration & Engineering Excellence
• Partner with backend, AI and platform teams on architecture and reliability.
• Influence system design through a reliability-first mindset.
• Mentor junior engineers and raise the overall bar for operational excellence.
Qualifications
Must-Have
• 3–6 years of experience in DevOps, SRE or Platform Engineering roles.
• Strong hands-on experience with Kubernetes and Docker in production environments.
• Familiarity with public cloud platforms and managed Kubernetes services (such as GKE).
• Strong understanding of distributed systems and production debugging.
• Hands-on experience with observability systems.
• Proficiency with Infrastructure as Code (Terraform).
• Strong incident ownership and communication skills.
Good-to-Have
• Experience with MLOps or AI inference platforms.
• Familiarity with LLM pipelines, real-time streaming or telephony systems.
• Experience operating GPU workloads.
• Knowledge of AIOps, anomaly detection or intelligent alerting.
• Cloud cost optimization experience.
Why Blue Machines
• Build global-scale AI infrastructure from India.
• Operate real-time Voice AI systems with 14.5M+ minutes in production.
• Work on low-latency, high-reliability platforms.
• Grow from DevOps/SRE into MLOps and AI platform engineering.
• High ownership, deep technical impact and real production scale