Quick Summary

Own and operate production infrastructure across multiple GKE clusters sup

Required Skills

Kubernetes GitOps Terraform Helm Prometheus Loki Grafana Google Cloud Monitoring Cloud Security Incident Response Security Hardening Compliance

Job Description

About Us

100ms operates two product lines at scale: a real-time Live Video platform powering latency-sensitive, high-concurrency video experiences, and an AI Agents platform that automates complex patient access workflows in U.S. healthcare.

Both products run on a shared, robust infrastructure foundation. You'll be joining the central platform team responsible for keeping both running reliably, securely, and at scale — serving developers and healthcare operators who depend on us around the clock.

What Will You Do:
• Own and operate production infrastructure across multiple GKE clusters supporting both real-time video workloads and AI agent pipelines — with HA, autoscaling, and full observability tuned to the demands of each.

•
Manage GitOps workflows using Argo CD for automated, version-controlled, and auditable deployments across both product lines.

•
Maintain and optimize monitoring & alerting stacks using Open Source Monitoring Tools — with product-specific SLOs for low-latency video (jitter, packet loss, stream health) and AI workflow reliability (task throughput, failure rates, retry queues).

•
Implement infrastructure as code using Terraform for GCP resources and helm chart for Kubernetes manifests, with a strong bias toward repeatability and auditability.

•
Support the unique infrastructure demands of real-time video — including media server scaling, WebRTC infrastructure, low-latency networking, and high-throughput data paths.

•
Support AI agent workloads — including LLM inference infrastructure, async task queues, and integration pipelines with external healthcare systems.

•
Lead or support incident response, cluster upgrades, and disaster recovery procedures across both platforms.

•
Own the security posture of our infrastructure — enforce least-privilege access controls, manage secrets hygiene, and drive security hardening across clusters and services.

•
Implement and maintain compliance-aligned controls relevant to healthcare data environments (e.g., encryption at rest/in transit, audit logging, network segmentation).

•
Collaborate with product and engineering teams to embed security early in the development lifecycle — shift-left on vulnerability scanning, dependency audits, and policy enforcement.

Who Can Apply:
•
Computer Science / Engineering degree or equivalent practical experience.

•
Minimum 3 years of hands-on experience with Kubernetes in a production environment.

•
Strong knowledge of CI/CD pipelines and GitOps workflows using Argo CD or similar tools.

•
Proficient in infrastructure automation using Terraform and Helm.

•
Experience in managing open source monitoring and logging stacks (Prometheus, Loki, Grafana, Alertmanager etc).

•
Working knowledge of cloud security principles — IAM, network policies, pod security, RBAC, and secrets management.

•
Comfortable with Linux systems, shell scripting, and basic networking — including an understanding of UDP/TCP behaviour relevant to real-time media or distributed systems.

Good to Have:
•
Prior experience managing large-scale, multi-tenant or mixed-workload infrastructure.

•
Exposure to real-time media infrastructure — WebRTC, SFUs, TURN/STUN servers, or media server orchestration.

•
Hands-on experience with secrets management tools such as HashiCorp Vault or Sealed Secrets.

•
Familiarity with security scanning and policy tools (e.g., Trivy, OPA/Gatekeeper, Falco).

•
Experience with GCP and GKE specifically.

•
Exposure to compliance frameworks relevant to healthcare or regulated industries (HIPAA awareness is a plus).

•
Experience with AI/ML inference workloads or async pipeline infrastructure (queues, workers, orchestrators).

•
Experience with open source contributions.

•
Strong inclination to stay current with evolving infrastructure, security, and platform engineering practices — and a willingness to share ideas internally or externally.

•
Ability to communicate fluently and clearly in English, written and spoken.

Why 100ms:
•
You'll work on genuinely varied infrastructure — real-time video at scale and AI-driven healthcare automation are both hard problems with different constraints, and you'll own both.

•
You'll be part of a small, high-ownership team at a fast-growing, engineering-first startup with a meaningful mission — powering real-time experiences and helping patients access treatment faster.

•
You'll work alongside engineers with deep experience in distributed systems, real-time media, AI infrastructure, and platform engineering at scale.

•
You'll have the freedom to grow as an individual contributor or step into a team leadership role — with room to define your own goals and impact.

•
Security and infrastructure are first-class concerns here, not support functions — your work directly shapes the trust and reliability our customers depend on.

Additional Information:
•
We place a strong emphasis on in-office collaboration to maintain a tight feedback loop and a strong engineering culture.

•
Employees are expected to work from the office at least three days a week.

Website:
• https://www.100ms.ai/

• https://www.100ms.live

Platform Engineer — Core Infrastructure

Interested in this position?

Quick Summary

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?