Senior Site Reliability Engineer
KMSTechnology1
Posted: March 26, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
As a Senior Site Reliability Engineer, you will design, develop, and deploy scalable and efficient infrastructure to support our AI-powered property intelligence platform.
Required Skills
Job Description
At KMS Technology Mexico, we are passionate about building innovative software solutions that drive impact. As part of an international tech company, we offer a collaborative and inclusive environment where your ideas matter and your growth is our priority.
We are looking for a Senior SRE to join our core engineering team in building the next generation of AI-powered property intelligence for the insurance industry. In this role, you will be the guardian of a platform’s availability, latency, and performance.
You will work at the heart of a high-demand ecosystem, ensuring that our Node.js microservices and AI/ML pipelines running on Google Cloud Platform (GCP) are resilient, scalable, and secure. This is a "Software Engineering approach to Operations" role, where automation is the default and manual intervention is a last resort.
Key Responsibilities
Infrastructure & Platform Engineering
• Cloud Architecture: Design and manage scalable, multi-regional infrastructure on GCP, leveraging GKE (Kubernetes), Cloud Run, and Pub/Sub.
• Infrastructure as Code (IaC): Maintain and evolve our infrastructure codebase using Terraform or Pulumi, ensuring environment parity across Staging and Production.
• Node.js Optimization: Partner with Fullstack teams to tune Node.js application performance, managing memory limits, event loop bottlenecks, and asynchronous execution in a containerized environment.
Observability & Reliability
• SLO/SLI Definition: Define and monitor Service Level Indicators (SLIs) and Objectives (SLOs) to measure the "health" of our property intelligence engine.
• Advanced Monitoring: Build comprehensive dashboards and alerting systems using Google Cloud Operations Suite (Stackdriver), Prometheus, or Grafana.
• Incident Management: Lead Root Cause Analysis (RCA) for production incidents and implement "Blameless Post-mortems" to prevent recurrence.
 
AI & Data Operations
• MLOps Integration: Support the scaling of AI models by optimizing GPU/TPU utilization and data ingestion pipelines within GCP.
Security & Compliance: Ensure the platform meets the rigorous data privacy standards of the insurance industry, including SOC2 and GDPR compliance.
Technical Requirements:
• 5+ years in an SRE, DevOps, or System Architecture role.
• GCP Expertise: Deep experience with Google Cloud Platform, specifically GKE, IAM, Cloud SQL, and VPC networking.
• Coding Proficiency: Strong experience with Node.js (backend services) and scripting in Python or Go for automation.
• Orchestration: Expert-level knowledge of Kubernetes (GKE), including Helm charts and service meshes (Istio/Anthos).
• CI/CD: Experience building high-frequency deployment pipelines with GitHub Actions, GitLab CI, or Google Cloud Build.
Professional Competencies:
• The "SRE Mindset": A passion for automation and a visceral dislike of repetitive manual tasks ("Toil").
• Strategic Communication: Ability to translate complex infrastructure risks into business impact for Stakeholders and Delivery Directors.
• AI-First Workflow: Proactive use of AI tools for log anomaly detection, predictive scaling, and automated troubleshooting.
Location: Guadalajara, Jalisco, Mexico (Hybrid) 
Benefits and Perks
Perks you enjoy at KMS Mexico
• Mexican law benefits
• 15 days of PTO (in year zero, from the first year onwards it is 3 days per year).
• 5 days' leave for the death of immediate family members, negotiable.
• Major Medical Expenses Insurance with coverage for immediate dependents (spouse and children).
• Annual performance bonus (≈10% of annualized salary).
• Annual salary adjustment.
• Employee Referral Bonus.
• Paid Certifications / Courses
• Coursera License.
• 5% Savings Fund.
• 5% Grocery Vouchers.