Cloud - Staff Site Reliability Engineer
Bedrock Robotics
Posted: January 14, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Required Skills
Job Description
We are seeking an experienced Staff Site Reliability Engineer to own and evolve our cloud infrastructure, with a focus on scalable design, operational excellence, and system reliability.
The ideal candidate brings a strong production-engineering mindset and a deep commitment to observability, resilience, and well-instrumented distributed systems while holding a high bar for production readiness and believes no service should ship without meaningful telemetry and safeguards in place.
This role is critical to scaling the infrastructure that underpins our core data pipelines and directly enables our Machine Learning and Robotics engineering teams. If you enjoy tackling complex production challenges and building robust, highly scalable systems, this role offers significant scope and impact.
What you will do
• System Design & Operations: Design, build, and operate highly scalable, reliable systems used by all Bedrock engineering teams.
• Cloud Infrastructure Ownership: Take full ownership of Bedrock’s cloud infrastructure (AWS, GCP, Azure), ensuring best-in-class security, performance, and cost efficiency.
• Observability Stack: Design, implement, and maintain Bedrock’s end-to-end observability stack (including monitoring, logging, and tracing).
• Production Excellence: Pave the road for production engineering by developing and implementing best practices for system reliability, security, on-call rotation, and effective incident response.
• Performance & Cost Optimization: Continuously identify and implement improvements to enhance system performance and optimize cloud resource consumption.
What we are looking
• Reliability Passion: A deep passion for building and maintaining reliable, fault-tolerant distributed systems.
• Cloud & IaC Expertise: Strong proficiency in major cloud platforms (such as AWS, GCP, or Azure) and Infrastructure as Code (IaC) tools like Terraform.
• Containerization & Orchestration: Proven experience with container technologies and orchestration platforms, particularly Kubernetes.
• Observability: Hands-on experience with observability tools (e.g., Datadog, Prometheus, Splunk) and techniques.
• Technical Foundations: Strong understanding of distributed systems, networking concepts, database technologies, and compute infrastructure.
• Security Best Practices: Strong understanding and experience implementing security best practices in cloud environments.
• Fast-Paced Environment: Ability to work in a fast-paced, high-growth environment, deal effectively with ambiguity, and take decisive ownership of challenging problems.