MisuJob - AI Job Search Platform MisuJob

Observability & Operations Engineer

Confidential

Not specified permanent

Posted: March 9, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

The Observability and Operations Engineer is responsible for designing and implementing complex technical solutions for monitoring and operating large-scale cloud infrastructure and internal platforms using AI-powered tooling and automation.

Job Description

Observability & Operations Engineer  

The Observability & Operations Engineer is a key technical contributor who brings an AI-first mindset to maintaining, monitoring, and operating our AWS cloud environment and internal Developer Platform. In this role, you won’t just react to incidents — you’ll leverage AI-powered tooling, intelligent alerting, and automation to get ahead of problems before they impact users. You’ll work deeply across AWS and its PaaS ecosystem, building repeatable, code-first pipelines that treat infrastructure and observability configuration as first-class software. From using AI coding assistants to accelerate runbook development, to applying ML-based anomaly detection across logs and metrics, you’ll be expected to ask “how can AI help here?” as a first instinct. Working within a dedicated platform team, you’ll build the observability foundations that keep our systems fast, reliable, and self-healing.

Primary Duties & Responsibilities:

Design and implement a comprehensive observability strategy (logging, metrics, tracing, alerting) across all AWS environments, leveraging AI-powered tools to detect anomalies and surface insights automatically

Build and manage monitoring platforms such as Datadog, Grafana, Prometheus, and AWS CloudWatch — actively exploring AI-native features within these tools to reduce alert fatigue and improve signal quality

Use AI coding assistants (e.g. GitHub Copilot, Claude) to accelerate development of dashboards, runbooks, and automation scripts

Own the incident management lifecycle — on-call rotations, post-mortems, root cause analysis — and apply AI-assisted log analysis to speed up diagnosis and resolution

Instrument Java, Kotlin, and Node.js-based cloud-native applications to emit structured logs, distributed traces, and metrics; identify opportunities to use ML-based anomaly detection in place of static thresholds

Build repeatable, code-first observability pipelines that treat dashboards, alerts, and runbooks as first-class software — versioned, tested, and deployed through Harness

Leverage AWS PaaS services (Lambda, API Gateway, ECS, RDS, SQS, SNS, and others) to build scalable, automated operational tooling

Collaborate with development teams to embed observability and AI-assisted quality checks into CI/CD pipelines via Harness

Own the FinOps function for our AWS environment — tracking cloud spend, building cost dashboards, identifying waste, and using AI-powered cost analysis tools to surface optimization opportunities and drive accountability across engineering teams

Monitor AWS infrastructure for performance, availability, and cost — partnering with finance and engineering to enforce spend governance

Develop and maintain Infrastructure as Code using Terraform, using AI pair programming to improve quality and consistency

Contribute to architectural decisions with a focus on resilience, automation, and reducing toil through intelligent systems

Adheres to all confidentiality and compliance regulations

Performs other duties as assigned

Minimum Education & Work Experience:

7–10 years of experience in Software Engineering, Cloud Operations, or Site Reliability Engineering

5+ years of hands-on experience with AWS infrastructure and AWS PaaS services; certifications are a plus

Demonstrated experience building repeatable, code-first pipelines and treating operational configuration as first-class software

Experience working with polyglot environments including Java, Kotlin, and Node.js

Demonstrated experience using AI tools (coding assistants, AI-powered observability platforms, or similar) in a professional setting — we’re an AI-first company and expect this to be part of how you work, not something you’re just exploring

Key Skills and Qualifications:

Deep experience with enterprise observability platforms — including AWS-native tooling such as CloudWatch, X-Ray, and OpenTelemetry, or comparable platforms such as Datadog, Grafana, or Prometheus

Proficiency with distributed tracing frameworks and log management platforms (e.g. ELK Stack, Splunk, Fluent Bit); experience mapping these patterns to AWS-native tooling is a strong plus

Strong understanding of SRE principles including SLOs, SLAs, error budgets, and chaos engineering

Hands-on FinOps experience — cloud cost allocation, chargeback modeling, rightsizing, and savings plans optimization across AWS

Strong working knowledge of AWS PaaS services including Lambda, API Gateway, ECS, RDS, SQS, SNS, and IAM — and how to leverage them to build scalable operational tooling

Experience instrumenting polyglot applications (Java, Kotlin, Node.js) and cloud-native microservices for observability

Proven ability to build repeatable, code-first pipelines — treating dashboards, alerts, runbooks, and infrastructure configuration as versioned, testable software

Experience with CI/CD tooling, specifically Harness

Solid understanding of Infrastructure as Code using Terraform

Fluency with AI tools in day-to-day work — whether that’s AI coding assistants, AI-powered monitoring features, or using LLMs to accelerate problem solving; you default to asking “can AI help here?” before doing things the hard way

Ability to lead incident response, facilitate blameless post-mortems, and drive long-term reliability improvements

Strong collaboration skills for working across platform and product engineering teams

Knowledge of containerization technologies and microservices architecture

Physical Demands and Work Environment:

The physical demands described here are representative of those that must be met by an employee to successfully perform the essential functions of this job. Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions

Regularly required to sit at a desk in front of a computer and use hands to finger, handle, or feel objects, tools, or controls (including a computer keyboard and operating a telephone), lift and/or move up to 10 pounds. 

Frequently requires the use of hands and arms for reaching, as well as the ability to walk and communicate effectively through speaking and listening.

Specific vision abilities required by this position include close vision, color vision, and the ability to adjust focus.   

Noise level in the work environment is usually moderate.

Type on a computer keyboard and look at a computer monitor, and operate a cell phone or a computer-based phone

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply