Sr. Site Reliability Engineer

Tiger Analytics Inc.

Washington, District of Columbia, United States permanent

Posted: May 8, 2026

Required Skills

Job Description

Role Overview

We are seeking a high-caliber Site Reliability Engineer (SRE) to join our Forward Engineering team. You will be the guardian of our production ecosystems, ensuring that our complex, data-driven AI platforms remain resilient, scalable, and highly performant. This role is a hybrid of software engineering and systems architecture, with a specialized focus on MLOps—bridging the gap between model development and production-grade reliability.

Key Responsibilities

1. Reliability & Performance Engineering

• SLA/SLO Management: Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Indicators (SLIs) for critical AI/ML services.
• Error Budgeting: Manage error budgets to balance the velocity of feature releases from the ML team with the stability of the production environment.
• Scalability: Architect and manage auto-scaling strategies for Kubernetes (GKE) to handle fluctuating workloads during model training and high-volume inference.

2. MLOps & AI Infrastructure

• Model Serving Reliability: Ensure the high availability of Vertex AI endpoints and custom inference services.
• GPU/TPU Optimization: Monitor and optimize compute resource utilization (accelerators) to ensure cost-efficient performance for Large Language Models (LLMs).
• Pipeline Resilience: Support and stabilize ML pipelines (Vertex AI Pipelines/Kubeflow) to ensure seamless data flow from ingestion to model retraining.

3. Automation & Orchestration (Eliminating "Toil")

• Infrastructure as Code (IaC): Use Terraform or Pulumi to provision and manage consistent, version-controlled cloud environments.
• CI/CD & GitOps: Design and optimize robust deployment pipelines for both application code and ML models using GitHub Actions, Cloud Build, or ArgoCD.
• Task Automation: Develop custom Python or Go scripts to automate repetitive operational tasks, self-healing mechanisms, and resource cleanup.

4. Monitoring, Alerting & Incident Response

• Observability: Build and manage comprehensive dashboards using Prometheus, Grafana, or Google Cloud Operations Suite (Stackdriver).
• Incident Management: Act as a primary responder in on-call rotations, leading the technical resolution of production outages.
• Blameless Post-Mortems: Conduct deep-dive root cause analysis (RCA) to ensure systemic issues are identified and permanently remediated through code.

Requirements:
Orchestration: Expert-level knowledge of Kubernetes (K8s) and Docker.

MLOps Stack: Familiarity with tools such as Kubeflow, Vertex AI, MLflow, or DVC.

Scripting: Strong proficiency in Python (for automation) and Bash; knowledge of Go is a plus.

Data Systems: Experience managing the reliability of data-heavy services (BigQuery, Pub/Sub, or Vector Databases like Pinecone/Milvus).

Networking: Solid understanding of VPCs, Load Balancers, DNS, and secure service mesh (Istio/Anthos).

Benefits:
Benefits

Significant career development opportunities exist as the company grows. The position offers a unique opportunity to be part of a small, fast-growing, challenging and entrepreneurial environment, with a high degree of individual responsibility.

Tiger Analytics provides equal employment opportunities to applicants and employees without regard to race, color, religion, age, sex, sexual orientation, gender identity/expression, pregnancy, national origin, ancestry, marital status, protected veteran status, disability status, or any other basis as protected by federal, state, or local law.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.