OpenShift Platform Engineer (GenAI)
Unison Group
Posted: April 8, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
Design and implement scalable and secure infrastructure for large-scale AI applications on OpenShift, focusing on disaster recovery and hybrid cloud environments.
Required Skills
Job Description
Overview
We are seeking an experienced Senior GenAI Platform Engineer / OpenShift SME to lead and manage enterprise-scale infrastructure supporting GenAI applications. This role focuses on OpenShift platform engineering, hybrid cloud environments, disaster recovery (DR), and security for highly scalable and resilient AI platforms.
Requirements
• 10+ years of experience in infrastructure engineering / platform engineering.
• Strong expertise in managing OpenShift (OCP) in enterprise production environments.
• Hands-on experience in infrastructure sizing, capacity planning, and performance tuning for AI workloads.
• Experience supporting Oracle Database from an infrastructure/application standpoint.
• Strong knowledge of certificate management, secrets handling, and key management.
• Experience with CI/CD pipelines and infrastructure automation.
• Solid background in security, vulnerability management, and compliance.
• Proven experience in designing and implementing Disaster Recovery (DR) solutions.
• Experience with AWS cloud services and hybrid cloud environments.
• Strong experience with Docker and Kubernetes.
• Excellent coordination and stakeholder management skills across cross-functional teams.
Key Responsibilities
• Lead and manage end-to-end infrastructure for enterprise GenAI applications hosted on OpenShift (OCP).
• Own capacity planning, sizing, and performance optimization of OpenShift clusters and related infrastructure components.
• Manage and optimize infrastructure including Oracle DB, Redis, Elastic DB, PostgreSQL, Dell ECS storage, and Linux environments (RedHat/Ubuntu).
• Design and implement Disaster Recovery (DR) strategies ensuring high availability, resilience, and business continuity.
• Lead E2E DR setup including replication, failover, testing, and documentation in collaboration with infra and network teams.
• Manage certificate lifecycle (TLS/SSL), secrets, and key management across platforms.
• Implement vulnerability management, patching, and remediation across Kubernetes, containers, and infrastructure.
• Support and coordinate penetration testing and address security findings.
• Work with AWS services (EC2, VPC, CloudWatch, Lambda, Bedrock) in hybrid cloud environments.
• Build and maintain infrastructure automation using Terraform and CloudFormation.
• Manage observability using monitoring, logging, alerting tools, and Control-M schedulers.
• Collaborate with DevOps, Security, and Development teams for platform reliability and performance.
• (Bonus) Work with or support open-weight LLM models for AI/ML use cases.