Senior HPC Software Engineer
NVIDIA
Posted: March 11, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
Join our team as a Senior HPC software engineer and be part of shaping the future of computing, collaborating with cross-functional teams to implement innovative solutions that have a significant impact on the world.
Required Skills
Job Description
Join our team as a Senior HPC software engineer. At NVIDIA, you'll be part of the team shaping the future of computing and guaranteeing the smooth operation of our brand-new technologies. Our mission is to leverage AI's power to build outstanding and pioneering solutions that have a significant impact on the world.
What you'll be doing:
• Own the solutions you build, collaborating with cross-functional teams to successfully implement them.
• Collaborate with various teams in a fast-paced environment to ensure seamless project completion.
• Continuously improve solution provisioning and management through automation.
• Detect performance issues and recommend solutions to maintain world-class service quality.
• Conduct capacity management and planning to meet ongoing operational needs.
• Participate in incident reviews, assist in root cause identification, and write RCA reports.
• Deliver SRE solutions in a globally distributed, multi-cloud hybrid environment - AWS, GCP, and On-prem.
• Participate in the team's on-call rotation.
What we need to see:
• B.S. degree in Computer Science or related technical field (or equivalent experience)
• 8+ years in building and supporting critical services
• 5+ years of coding/scripting experience in at least two high-level programming languages such as Python, Go, Ruby, or Groovy.
• Proficiency in Kubernetes administration, modern CI/CD techniques and Infrastructure as Code (IaC).
• Full-stack AI experience with deep expertise in MCP ecosystems, Carpenter, n8n orchestration, and AI-assisted development via Cursor.
• Expertise with at least one major cloud service provider - AWS, GCP, Azure.
• Demonstrated proficiency with end-to-end SRE capabilities and observability.
• Proficient in monitoring, metrics gathering, APM, container management, and log collection tools.
• Creative problem solver with excellent debugging skills and great communication and documentation abilities.
Ways to stand out from the crowd:
• Linux certification from a well-known vendor - RedHat, Oracle, etc.
• Prior experience managing large-scale Kubernetes deployment in production.
• Strong skills in modern container networking and storage architecture.
• Hands-on background working with Flexlm and license management system.
• Hands-on experience working with Slurm/LSF environments.