Staff Platform & SRE Engineer - High Performance Computing Platform Management
Csit
Posted: May 14, 2025
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
We are seeking an experienced HPC Staff Engineer to join our team, responsible for managing and optimizing high-performance computing platform infrastructure using cutting-edge technologies such as cloud-based and software-defined networking e.g. SD-WAN, ACI and NSX.
Required Skills
Job Description
You will be part of the dynamic team responsible for building resilient network infrastructure using cutting-edge technologies such as cloud-based and software-defined networking e.g. SD-WAN, ACI and NSX. You must have a good understanding of IT infrastructure systems, and knowledge in the latest networking technologies and platforms. You will be a technical specialist in a team, and must be keen to take on new challenges and keep abreast with rapidly evolving technology landscape.
Role:
• We are seeking an experienced HPC Staff Engineer to join our team, responsible for managing and optimizing our HPC infrastructure platform. The successful candidate will have a deep understanding of HPC systems, architectures and technologies, as well as experience with managing large-scale computing environments. The role will involve designing, implementing and maintaining the HPC infrastructure platform, ensuring high availability, scalability and performance.
Responsibilities:
• Lead a team to deliver resilient, scalable and secure HPC platform, including compute nodes, storage systems, networks and job scheduling systems.
• Lead, design, implement and manage the HPC infrastructure platform to meet organisational needs.
• Design and implement storage solutions for HPC workloads to ensure efficient data storage and retrieval.
• Design and implement high-performance networking solutions, including InfiniBand, Ethernet, and other interconnects.
• Plan and manage HPC resource capacity, including forecasting, procurement and deployment of new hardware and software.
• Manage HPC clusters, including optimizing, monitoring and troubleshooting cluster performance, as well as managing job scheduling and resource allocation.
• Ensure the security and compliance of the HPC infrastructure platform, including managing access controls, implementing security patches, and conducting regular security checks.
• Collaborate with stakeholders like data scientists and developers to optimize application performance on the HPC platform and provide technical support on using the HPC infrastructure platform.
Requirements (Minimum Qualifications):
• Background in Computer Science, Computer Engineering, or a related field.
• 8+ years of experience in managing HPC systems, including experience with Linux, Unix, or other operating systems.
• Strong knowledge of HPC architectures, including clusters, grids, and clouds.
• Experience with HPC job scheduling systems, such as Slurm, Torque and LSF.
• Strong understanding of storage systems, including SANs, NAS, and object storage.
• Experience with high-performance networking, including InfiniBand, Ethernet, and other interconnects.
• Experience with cloud computing platforms, such as AWS, Azure, or Google Cloud.
• Experience with scripting languages, such as Python, Perl, or Bash.
• Experience with containerization (Docker, Kubernetes) and proficient in a range of complementary technologies, including Knative, Run:AI, Grafana, Prometheus, Kyverno, ArgoCD, Rancher, NVIDIA BCM and knowledge of NVIDIA Superpod architecture.
• Experience in leading engineering teams.
Nice to Have:
• Certifications in NVIDIA AI Infrastructure and Operations, and Certified Kubernetes Administrator.
• Experience with machine learning or deep learning frameworks, such as TensorFlow or PyTorch.
• Familiarity with agile development methodologies and version control systems, such as Git.
Why join us? :
• The work is purposeful and meaningful
• You will work with the best engineers
• We work with modern technologies and tech stacks
• We have excellent engineering culture and work-life balance
• We aspire to engineering and operational excellence
• We empower to innovate
• We grow together as a family
As CSIT is an agency under the Ministry of Defence (Singapore), only Singapore Citizens will be considered.