SLURM HPC Architect / Administrator
Confidential
Posted: February 19, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Quick Summary
Design, deploy, and operate high-performance computing (HPC) clusters supporting AI training, large-scale inference, scientific computing, and enterprise workloads.
Required Skills
Job Description
SLURM HPC Architect / Administrator
Location: Remote (Canada, U.S., or Europe Preferred)
Company: Cylix Applied Intelligence
Employment Type: Full-Time or Contract
About the Role
Cylix Applied Intelligence is seeking an experienced SLURM HPC Architect / Administrator to design, deploy, and operate high-performance computing (HPC) clusters supporting AI training, large-scale inference, scientific computing, and enterprise workloads.
This role will focus on building and managing enterprise-grade HPC environments powered by GPU and CPU compute clusters, leveraging SLURM as the core workload orchestration and resource scheduling platform.
You will work closely with AI engineers, infrastructure teams, and enterprise clients to deliver scalable, reliable, and high-performance compute environments across on-premise, hybrid, and cloud platforms.
Key Responsibilities
HPC Cluster Architecture and Design
Design and implement SLURM-based HPC cluster architectures
Architect scalable CPU and GPU compute environments
Define cluster topology including compute, storage, login, and management nodes
Design high-availability SLURM controller configurations
Implement cluster segmentation, partitioning, and resource allocation strategies
SLURM Deployment and Administration
Install, configure, and manage SLURM workload manager environments
Configure SLURM partitions, queues, QoS policies, and scheduling policies
Manage job scheduling optimization and fair-share policies
Implement accounting, usage tracking, and reporting systems
Maintain SLURM cluster health, stability, and performance
GPU Cluster and AI Infrastructure Management
Configure GPU scheduling and allocation policies
Support GPU resource management including:
NVIDIA A100, H100, L40, and similar accelerator platforms
MIG partitioning and GPU isolation
Multi-tenant GPU resource allocation
Optimize cluster performance for AI training and inference workloads
Infrastructure Automation and Operations
Automate cluster deployment and configuration using:
Ansible, Terraform, or similar tools
Shell scripting and Python
Implement monitoring, alerting, and performance tracking systems
Support cluster lifecycle management, upgrades, and expansion
Storage and Filesystem Integration
Integrate HPC clusters with high-performance storage systems including:
NFS
Lustre
BeeGFS
GPFS / Spectrum Scale
Optimize I/O performance and storage architecture
User and Workload Support
Support enterprise and research users with job scheduling and optimization
Troubleshoot job failures and performance issues
Assist engineering teams in optimizing workloads for HPC environments
Required Qualifications
3+ years experience administering HPC clusters
Strong experience with SLURM workload manager
Strong Linux system administration experience (Ubuntu, Rocky Linux, RHEL, or similar)
Experience with HPC cluster architecture and deployment
Experience with shell scripting and automation
Experience with:
Cluster resource management
Multi-node distributed computing environments
SSH, networking, and Linux system internals
Preferred Qualifications
Experience managing GPU-based HPC clusters
Experience supporting AI / ML workloads
Experience with NVIDIA GPU platforms and drivers
Experience with:
CUDA environments
NVIDIA MIG configuration
GPU scheduling optimization
Experience with configuration management tools:
Ansible
Terraform
Puppet or Chef
Experience with monitoring tools such as:
Prometheus
Grafana
Node exporter
SLURM accounting tools
Nice to Have
Experience with large-scale enterprise or cloud HPC environments
Experience deploying HPC environments in cloud platforms such as:
AWS
Azure
Private cloud environments
Experience with containerized HPC workloads:
Docker
Singularity / Apptainer
Experience integrating SLURM with Kubernetes or hybrid orchestration systems
Example Projects You Will Work On
Deployment of enterprise AI GPU clusters
Multi-tenant SLURM cluster architecture design
GPU scheduling optimization for AI workloads
HPC infrastructure for large-scale inference and model training
Hybrid HPC environments spanning data center and cloud
HPC cluster performance optimization and scaling
Technology Environment
You will work with:
SLURM Workload Manager
Linux (Ubuntu, Rocky Linux, RHEL)
NVIDIA GPU platforms (A100, H100, L40)
High-performance storage systems
HPC networking (InfiniBand, high-speed Ethernet)
Automation tools (Ansible, Terraform)
Monitoring tools (Prometheus, Grafana)
Container environments (Docker, Apptainer)
What We Offer
Competitive compensation
Remote-first environment
Opportunity to work with cutting-edge HPC and AI infrastructure
Exposure to enterprise-scale AI and compute environments
Flexible employment structure (Full-Time or Contract)
Opportunity to architect next-generation HPC environments
About Cylix Applied Intelligence
Cylix Applied Intelligence builds enterprise AI infrastructure and high-performance computing environments supporting advanced AI workloads, intelligent automation, and enterprise-scale compute platforms.