Lead Site Reliability Engineer
KMSTechnology1
Posted: March 27, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
We are seeking a Lead Site Reliability Engineer to spearhead the reliability, scalability, and performance of our AI-powered property intelligence platform. The ideal candidate will have expertise in Geospatial AI and Insurance Technology, with a background in a similar industry. The successful candidate will be responsible for leading the development and maintenance of our AI-powered property intelligence platform.
Required Skills
Job Description
At KMS Technology, we are dedicated to delivering cutting-edge solutions and services that empower businesses to achieve their goals. Our team is composed of highly skilled professionals who are passionate about technology and innovation. We provide a dynamic and collaborative work environment where you can grow your career and make a significant impact.
 
We are seeking a Lead Site Reliability Engineer to spearhead the reliability, scalability, and performance of our AI-powered property intelligence platform. Operating at the intersection of Geospatial AI and Insurance Technology, you will be responsible for a mission-critical Azure ecosystem supporting high-throughput Java microservices.
As a Lead, you will bridge the gap between complex AI model inference and enterprise-grade stability. You will own the "Production Excellence" mandate, mentoring a team of engineers and collaborating with Senior Delivery Directors to ensure our global infrastructure stays ahead of our rapid growth.
Key Responsibilities
Strategic Infrastructure & Azure Leadership
• Cloud Architecture: Lead the design of highly available, multi-region architectures on Azure, utilizing AKS (Azure Kubernetes Service), Azure Functions, and Service Bus.
• IaC Governance: Establish and enforce standards for Infrastructure as Code using Terraform or Bicep, ensuring 100% automated provisioning across all environments.
• Java Performance Engineering: Partner with Backend squads to optimize JVM performance, garbage collection tuning, and memory management for high-concurrency insurance processing.
Reliability & AI Operations (AIOps)
• Error Budgeting: Define, negotiate, and manage SLIs, SLOs, and SLAs with Product Stakeholders, balancing the velocity of AI feature releases with system stability.
• Advanced Observability: Architect end-to-end monitoring and distributed tracing using Azure Monitor, Application Insights, and ELK/Grafana.
• Incident Commander: Act as the ultimate escalation point for high-priority incidents, leading complex Root Cause Analysis (RCA) and driving long-term remediation tasks.
Security & Industry Compliance
• Data Sovereignty: Ensure the platform adheres to insurance-specific data residency requirements and security frameworks (SOC2, HIPAA, or ISO 27001).
• Automated Governance: Implement Azure Policy and automated security scanning within CI/CD pipelines to ensure a "Secure by Design" infrastructure.
 
Technical Leadership:
• 7+ years in SRE, DevOps, or Cloud Engineering, with at least 2 years in a Lead or Principal capacity.
• Azure Mastery: Expert-level knowledge of the Azure Well-Architected Framework, specifically around networking (VNet/ExpressRoute) and Compute.
• Java Ecosystem: Deep proficiency in the Java/Spring Boot stack from an operational perspective (JVM profiling, thread dump analysis).
• Container Orchestration: Mastery of Kubernetes (AKS), including ingress controllers, service mesh (Istio), and cluster security.
Professional Competencies:
• Strategic Mindset: Ability to translate technical debt and reliability risks into a data-driven business case for leadership.
• Automation Advocate: Proven track record of eliminating "Toil" through Python, Go, or Java-based automation tooling.
• Mentorship: Passion for leveling up the engineering organization through workshops, documentation, and pair programming.
• AI-First Integration: Experience leveraging AI for predictive scaling and automated log summarization to reduce Mean Time to Recovery (MTTR).
Perks you enjoy at KMS Mexico
• Mexican law benefits
• 15 days of PTO (in year zero, from the first year onwards it is 3 days per year).
• 5 days' leave for the death of immediate family members, negotiable.
• Major Medical Expenses Insurance with coverage for immediate dependents (spouse and children).
• Annual performance bonus (≈10% of annualized salary).
• Annual salary adjustment.
• Employee Referral Bonus.
• Paid Certifications / Courses
• Coursera License.
• 5% Savings Fund.
• 5% Grocery Vouchers.