MisuJob - AI Job Search Platform MisuJob

Site Reliability Engineer

Sitetracker

Canada Remote permanent

Posted: April 30, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

To design, build, and maintain the reliability of our AI platform, ensuring it meets the highest standards of performance and availability.

Job Description

The Opportunity

This is your chance to build a reliability practice from the ground up and establish the engineering standards—including SLOs, error budgets, and observability—that will protect our platform as we scale for enterprise customers and expand our AI capabilities. You’ll have the autonomy to set the strategy and the trust to execute it, ensuring that our AI workloads (Evals, RAG, and LLM processing) meet the highest reliability standards. If you are a proactive problem solver who treats toil as an engineering challenge and wants the agency to decide which technologies to adopt and when, you will find this to be a career-defining role.

What You'll Do

As a Staff or Senior Staff SRE, you’ll hit the ground running by partnering with the engineers currently managing reliability to transition the organization from reactive firefighting to a proactive, disciplined reliability practice. You will lead the deliberate evolution of our infrastructure, recognizing the inflection point for new tooling and leading migrations away from manual scripts and templates only when they’ve earned their keep. Whether you are architecting incident response structures or solving novel reliability problems for AI agents, your work will act as a multiplier that empowers the entire engineering team.

By bringing a consulting mindset to every challenge, you’ll propose technical trade-offs based on evidence rather than reflex, ensuring our roadmap for multi-region or service mesh adoption is built for tomorrow. You won't just be handed tasks; you will own the strategy for production-readiness and deploy safety, building the organizational trust needed to make reliability a core differentiator of our product.

The Skills You'll Have

Deep SRE Expertise


Define SLIs and SLOs for critical user journeys and use them to drive proactive engineering decisions.


Run live production incident response as an Incident Commander and lead blameless postmortems that result in shipped follow-up actions.


Builds observability that tells a story -- dashboards that explain a system's behavior to someone seeing it for the first time -- and actionable alerts.


Take an organization from reactive firefighting to a working reliability practice with measurable improvements in paging volume.


Design error-budget policies and use them to make data-driven trade-offs between shipping features and maintaining reliability.

Deep Technical Expertise in AWS


Designs and operates services on AWS competently — VPC, IAM, compute (ECS/EC2/Lambda), managed data services, and load balancing.


Navigate our current setup of CloudFormation and bash scripts via GitHub Actions effectively without reaching for Terraform reflexively.


Debug production AWS issues at the network and IAM level without escalating to AWS support as a first step.


Design and roll out production workloads across multiple regions and countries while accounting for data residency and regional failure modes.


Lead high-stakes tooling migrations into established environments and manage the long-term consequences of those architectural choices.

Impact, Leadership & Team Enablement


Mentor engineers through pair debugging, postmortem coaching, and runbook reviews to leave the team more capable.


Define alerts for impactful metrics and write the clear, actionable runbooks that go with them.


Work with engineering teams to gather requirements for new infrastructure and conduct constructive production-readiness reviews.


Teach teams how to build their own observability dashboards, raising the technical floor across the entire organization.


Use AI tooling aggressively, including coding agents and log analysis, to accelerate the delivery of impactful changes.

Communication & Influence


Communicate scheduled downtime and infrastructure changes to stakeholders proactively with clear timing and expected impact.


Write postmortems that both engineers and non-engineers can read, understand, and learn from.


Act as the recognized Subject Matter Expert for AWS-related questions across the engineering organization.


Influence product and engineering roadmap decisions by using data and evidence rather than opinion when reliability is a factor.


Build organizational trust so that teams seek out the SRE practice early in the development cycle to make their work better.


Within 90 Days, You'll:

Fully onboard and partner with the engineers currently managing reliability to review and revise the existing operational plan.


Operationalize high-leverage items to transition the team out of reactive firefighting and into a more stable, proactive state.


Establish a baseline for current system behavior by identifying the most critical user journeys that require immediate SLI/SLO definitions.


Within 180 Days, You'll:

Independently drive the revised reliability plan, ensuring SLIs/SLOs are in place and actively used to guide engineering decisions.


Standardize the incident response structure, including severity definitions, Incident Commander roles, and a cadence for blameless postmortems.


Measurably reduce paging volume and ensure that every alert that pages an engineer is backed by a clear, effective runbook.


Within 365 Days, You'll:

Establish a mature reliability practice where production-readiness reviews and error-budget conversations are default parts of the development lifecycle.


Define a clear, evidence-based tooling roadmap for the next phase of our evolution, such as Terraform, service mesh, or multi-region expansion.


Serve as an organizational multiplier, having built the observability and culture necessary for other engineers to reason about reliability without constant supervision.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply