ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Staff Machine Learning Operations Engineer - Devops/SRE

Housecall

Brazil Remote permanent

Posted: December 22, 2025

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

We are looking for a Staff Machine Learning Operations Engineer - Devops/SRE to join our team in Brazil.

Job Description

TO BE CONSIDERED FOR THIS ROLE, PLEASE SUBMIT AN UPDATED RESUME TRANSLATED TO ENGLISH

Why Housecall Pro?

Help us build solutions that build better lives. At Housecall Pro, we show up to work every day to make a difference for real people: the home service professionals that support America’s 100 million homes. We’re all about the Pro, and dedicate our days to helping them streamline operations, scale their businesses, and—ultimately—save time so they can be with their families and live well. We care deeply about our customers and foster a culture where our company, employees, and Pros grow and succeed together. Leadership is as focused on growing team members’ careers as they expect their teams to be on creating solutions for Pros.

🤜🤛 WHAT’S IN IT FOR YOU?

• 💻🌎Remote environment: totally built to make you feel that we are all together in one space without leaving your home office!

• 😎🏝Self Managed PTO: Beach? Mountains? Camping? Discovering new experiences? You are free to take time out as you need!

• ⏰Flexible work hours: We believe that you can reach your professional and personal goals working with us and encourage you to have a work life balance!

• 💡 A culture built on innovation that values big ideas: We are always open to new ideas that will improve the life of our Pros!

• 💻 MacBook (or PC if you prefer!) + Setup Fee ($500): What is remote work without the right tools? Here at HCP, you can choose your computer and set up your home office!

We know what you are thinking…WHAT IS THE ROLE AND WHAT WOULD YOU BE DOING? 👀

As a Staff Machine Learning Operations Engineer - Devops/SRE, you’ll anchor operations for our LLM- and ML‑powered services running on AWS, Kubernetes, Snowflake, and Datadog, all built and governed as code with Terraform. As a staff‑level engineer, you’ll combine deep hands‑on expertise with strong communication, project leadership, and architectural judgment to raise the bar on performance, resilience, observability, and maintainability.

Our team is passionate, empathetic, hard working, and above all else focused on improving the lives of our service professionals (our Pros). Our success is their success.

In your day to day, you will:


Day‑to‑day reliability & operations


Own SRE fundamentals for AI/ML services: define SLIs/SLOs, manage error budgets, triage incidents, lead on‑call, and drive blameless post‑mortems to durable fixes.


Operate and scale EKS‑based workloads (real‑time inference, batch jobs, data/feature pipelines), including autoscaling (HPA/cluster autoscaler/Karpenter), rollout strategies, and capacity planning.


Build and maintain proactive observability in Datadog (APM/traces, metrics, logs, synthetics, SLOs, dashboards, alert pipelines) with actionable, low‑noise alerts.


Keep environments healthy and consistent via Terraform (modular IaC, policies, drift detection), immutable builds, and standardized deployment patterns.

LLMOps & MLOps platform stewardship


Operate reliable model‑serving stacks (LLMs and traditional models), including traffic shaping/canary releases, versioning, and rollback safety.


Ensure retrieval/feature pipelines are robust and cost‑efficient end‑to‑end—data sourcing, transformation, validation, scheduling, and monitoring.


Manage data plane integrations with Snowflake (warehouses, tasks, streams, materialized views), tuning for performance/credits and enforcing governance/roles.


Instrument model pathways for latency, throughput, token/compute cost, drift, guardrails, and quality/evaluation signals; surface these in Datadog.

Performance & cost (FinOps)


Continuously reduce p95/p99 latency and variability across services and pipelines.


Optimize AWS (right‑sizing, spot/adaptive capacity, storage classes), Snowflake (warehouse sizing, auto‑suspend/resume, clustering/partitioning, caching), and Kubernetes (requested/limits hygiene, bin‑packing) for measurable savings.


Publish cost dashboards and unit‑economics (e.g., cost per 1k requests/tokens/model run) and drive roadmap items that improve both cost and performance.

Architecture, security, and delivery


Design resilient, multi‑AZ architectures with clear backup/restore, DR, and change‑management guardrails.


Strengthen least‑privilege access, secrets management, and data protections (IAM/KMS, network boundaries, Snowflake roles/shares).


Lead projects end‑to‑end: scope, plan, communicate milestones/risks, align stakeholders, and deliver reliably.

We think this role is for you if have...


Staff‑level mastery (design + deep hands‑on) with:


AWS (EKS, EC2, VPC, IAM/IRSA, ALB/NLB, S3, KMS; comfort with scaling, networking, security boundaries).


Kubernetes (workload autoscaling, rollout strategies, Helm/GitOps patterns, capacity & cost optimization).


Terraform (modular design, environment separation, policy-as-code, drift control).


Datadog (APM/tracing, logs, metrics, synthetics, SLOs; building actionable dashboards and alert pipelines).


Snowflake (warehouse sizing and tuning, tasks/streams, performance optimization, cost/credit governance, RBAC).


Proven experience running LLM/ML production systems (model serving, data/feature pipelines, evaluation, and guardrails).


Strong communication and stakeholder management; able to lead cross‑functional projects and set architectural direction.


Track record of improving performance, resiliency, observability, and maintainability in complex, distributed systems.


Solid incident command, on‑call ownership, and post‑mortem leadership

What will help you succeed???

• Workflow orchestration for data/ML (Airflow/Dagster/Prefect) and model life‑cycle tooling (e.g., MLflow/Kubeflow/SageMaker).

• Service‑to‑service networking and security hardening (ingress, mTLS, WAF, secrets).

• Streaming and event systems (Kafka/Kinesis), cache/datastores (Redis/Postgres), and queueing patterns.

• GitOps (Argo CD/Flux), container registries, and supply‑chain security (SBOM, image scanning).

• Chaos/gameday practices and progressive delivery (blue/green, canary, feature flags)

✨ Let’s talk numbers! ✨
Our compensation range for this role begins at $7,500 USD per month 💵

Housecall Pro is a fintech company founded in 2013. We built a SaaS platform that helps Home Service Professionals operate their businesses. We created the application for plumbers, electricians, and other Pros in the home improvement/trades industries.

Housecall Pro is a simple, cloud-based field service management software platform aimed at helping companies keep track of jobs, monitor technician activity, and produce invoices easily.

Our core product helps our clients with scheduling, dispatching, job management, invoicing, payment processing, marketing, and more. They used to struggle with the ton of paperwork after their hours. Now they can save time, and manage their business in one app.

We support more than 27,000 businesses and have over 1,300 ambitious, mission-driven employees in San Diego, Denver, and all over the world (including 200+ talented and innovative Engineers). #LI-Remote

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply