HPC Specialist
Confidential
Posted: January 30, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
Design, deploy, and manage high-performance computing systems for scientific research, engineering simulations, and AI/ML workloads.
Required Skills
Job Description
🔬 HPC Specialist – Role Overview
Arcadion is seeking an HPC (High-Performance Computing) Specialist responsible for the design, deployment, optimization, and management of high-performance computing systems, often used in scientific research, engineering simulations, AI/ML workloads, and large-scale data analytics.Â
đź§ Core Responsibilities
Architecture & System Design
Design scalable HPC clusters (on-prem, cloud, or hybrid)
Choose appropriate CPUs, GPUs, interconnects (e.g., InfiniBand), and storage
Configure Slurm, PBS, or OpenHPC job schedulers
Cluster Deployment & Maintenance
Install and manage Linux-based compute nodes
Maintain job schedulers and resource managers
Integrate monitoring tools (Prometheus, Grafana, Nagios)
Performance Tuning & Optimization
Benchmark workloads and tune for performance (e.g., MPI, CUDA, OpenMP)
Optimize I/O and inter-node communication
Ensure efficient job execution and queue handling
Cloud & Hybrid HPC Integration
Deploy and manage cloud-based HPC environments (Azure CycleCloud, AWS ParallelCluster, Google Cloud HPC Toolkit)
Optimize workload portability and orchestration (e.g., Singularity, Kubernetes with KubeFlow or Volcano)
AI/ML & GPU Workload Support
Manage AI pipelines that require HPC acceleration (e.g., LLM training)
Optimize GPU usage (NVIDIA A100/H100, AMD MI300)
Interface with TensorFlow, PyTorch, and HPCML tools
Security & Compliance
Implement security best practices for multi-user environments
Support data governance for sensitive or regulated workloads
Maintain audit trails and role-based access control
User & Application Support
Assist researchers and data scientists with job submissions and optimization
Develop documentation, training materials, and run code validation sessions
🛠️ Technical Skills & Tools Category          Technologies Schedulers Slurm, PBS, Torque, LSF HPC OS & Config RHEL/CentOS, Rocky, Ubuntu Server HPC File Systems Lustre, BeeGFS, GPFS Parallel Computing MPI, OpenMP, CUDA Monitoring/Telemetry Prometheus, Grafana, Ganglia Cloud HPC AWS HPC, Azure CycleCloud, GCP HPC Toolkit Containers Singularity, Apptainer, Docker, Kubernetes AI/ML Support PyTorch, TensorFlow, Horovod, MLFlow DevOps Tools Ansible, Terraform, Git, Jenkins
🎓 Qualifications
Bachelor’s or Master’s in Computer Science, Engineering, Physics, or a related technical field
3–7 years of experience in HPC environments
Experience supporting AI/ML teams is a major asset
Certifications: NVIDIA DLI, AWS Certified HPC Specialist, Linux+ or RHCE