Quick Summary

Design, deploy, and operate high-performance computing (HPC) clusters supporting AI training, large-scale inference, scientific computing, and enterprise workloads.

Required Skills

GPU Cluster Architecture SLURM Linux System Administration GPU Management Automation Storage Systems User Support Troubleshooting

Job Description

SLURM HPC Architect / Administrator

Location: Remote (Canada, U.S., or Europe Preferred)
Company: Cylix Applied Intelligence
Employment Type: Full-Time or Contract

About the Role

Cylix Applied Intelligence is seeking an experienced SLURM HPC Architect / Administrator to design, deploy, and operate high-performance computing (HPC) clusters supporting AI training, large-scale inference, scientific computing, and enterprise workloads.

This role will focus on building and managing enterprise-grade HPC environments powered by GPU and CPU compute clusters, leveraging SLURM as the core workload orchestration and resource scheduling platform.

You will work closely with AI engineers, infrastructure teams, and enterprise clients to deliver scalable, reliable, and high-performance compute environments across on-premise, hybrid, and cloud platforms.

Key Responsibilities

HPC Cluster Architecture and Design

Design and implement SLURM-based HPC cluster architectures

Architect scalable CPU and GPU compute environments

Define cluster topology including compute, storage, login, and management nodes

Design high-availability SLURM controller configurations

Implement cluster segmentation, partitioning, and resource allocation strategies

SLURM Deployment and Administration

Install, configure, and manage SLURM workload manager environments

Configure SLURM partitions, queues, QoS policies, and scheduling policies

Manage job scheduling optimization and fair-share policies

Implement accounting, usage tracking, and reporting systems

Maintain SLURM cluster health, stability, and performance

GPU Cluster and AI Infrastructure Management

Configure GPU scheduling and allocation policies

Support GPU resource management including:

NVIDIA A100, H100, L40, and similar accelerator platforms

MIG partitioning and GPU isolation

Multi-tenant GPU resource allocation

Optimize cluster performance for AI training and inference workloads

Infrastructure Automation and Operations

Automate cluster deployment and configuration using:

Ansible, Terraform, or similar tools

Shell scripting and Python

Implement monitoring, alerting, and performance tracking systems

Support cluster lifecycle management, upgrades, and expansion

Storage and Filesystem Integration

Integrate HPC clusters with high-performance storage systems including:

NFS

Lustre

BeeGFS

GPFS / Spectrum Scale

Optimize I/O performance and storage architecture

User and Workload Support

Support enterprise and research users with job scheduling and optimization

Troubleshoot job failures and performance issues

Assist engineering teams in optimizing workloads for HPC environments

Required Qualifications

3+ years experience administering HPC clusters

Strong experience with SLURM workload manager

Strong Linux system administration experience (Ubuntu, Rocky Linux, RHEL, or similar)

Experience with HPC cluster architecture and deployment

Experience with shell scripting and automation

Experience with:

Cluster resource management

Multi-node distributed computing environments

SSH, networking, and Linux system internals

Preferred Qualifications

Experience managing GPU-based HPC clusters

Experience supporting AI / ML workloads

Experience with NVIDIA GPU platforms and drivers

Experience with:

CUDA environments

NVIDIA MIG configuration

GPU scheduling optimization

Experience with configuration management tools:

Ansible

Terraform

Puppet or Chef

Experience with monitoring tools such as:

Prometheus

Grafana

Node exporter

SLURM accounting tools

Nice to Have

Experience with large-scale enterprise or cloud HPC environments

Experience deploying HPC environments in cloud platforms such as:

AWS

Azure

Private cloud environments

Experience with containerized HPC workloads:

Docker

Singularity / Apptainer

Experience integrating SLURM with Kubernetes or hybrid orchestration systems

Example Projects You Will Work On

Deployment of enterprise AI GPU clusters

Multi-tenant SLURM cluster architecture design

GPU scheduling optimization for AI workloads

HPC infrastructure for large-scale inference and model training

Hybrid HPC environments spanning data center and cloud

HPC cluster performance optimization and scaling

Technology Environment

You will work with:

SLURM Workload Manager

Linux (Ubuntu, Rocky Linux, RHEL)

NVIDIA GPU platforms (A100, H100, L40)

High-performance storage systems

HPC networking (InfiniBand, high-speed Ethernet)

Automation tools (Ansible, Terraform)

Monitoring tools (Prometheus, Grafana)

Container environments (Docker, Apptainer)

What We Offer

Competitive compensation

Remote-first environment

Opportunity to work with cutting-edge HPC and AI infrastructure

Exposure to enterprise-scale AI and compute environments

Flexible employment structure (Full-Time or Contract)

Opportunity to architect next-generation HPC environments

About Cylix Applied Intelligence

Cylix Applied Intelligence builds enterprise AI infrastructure and high-performance computing environments supporting advanced AI workloads, intelligent automation, and enterprise-scale compute platforms.

SLURM HPC Architect / Administrator

Interested in this position?

Quick Summary

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?