ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

Senior HPC & GPU Infrastructure Engineer

Sciforium

San Francisco, California, United States permanent

Posted: January 7, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster.

Job Description

Sciforium is an AI infrastructure company developing next-generation multimodal AI models and a proprietary, high-efficiency serving platform. Backed by multi-million-dollar funding and direct sponsorship from AMD with hands-on support from AMD engineers the team is scaling rapidly to build the full stack powering frontier AI models and real-time applications.

About the role

We are seeking a Senior HPC & GPU Infrastructure Engineer to take full ownership of the health, reliability, and performance of our GPU compute cluster. You will be the primary PyTOrchcustodian of our high-density accelerator environment and the linchpin between hardware operations, distributed systems, and machine learning workflows. This role spans everything from hands-on Linux systems engineering and GPU driver bring-up to maintaining the ML software stack (CUDA/ROCm, PyTorch, JAX, vLLM). If you love squeezing every bit of performance out of hardware, enjoy debugging GPUs at scale, and want to build world-class AI infrastructure, this role is for you.

What you'll do

1. System Health & Reliability (SRE)

• On-Call Response: Act as the primary responder for system outages, GPU failures, node crashes, and cluster-wide incidents. Minimize downtime by resolving issues rapidly.

• Cluster Monitoring: Implement and maintain monitoring for GPU health, thermal behavior, PCIe/NVLink topology issues, memory errors, and overall system load.

• Vendor Liaison: Coordinate with data center staff, hardware vendors, and on-site technicians for repairs, RMA processing, and physical maintenance of the cluster.

2. Linux & Network Administration

• OS Management: Install, patch, and maintain Linux distributions (Ubuntu / CentOS / RHEL). Ensure consistent configuration, kernel tuning, and automation for large node fleets.

• Security & Access Controls: Configure VPNs, iptables/firewalls, SSH hardening, and network routing to secure our computer infrastructure.

• Identity & Storage Management: Manage LDAP/FreeIPA/AD for user identity, and administer distributed file systems such as NFS, GPFS, or Lustre.

3. GPU & ML Stack Engineering

• Deployment & Bring-Up: Lead deployment of new GPU nodes, including BIOS configuration, NUMA tuning, GPU topology validation, and cluster integration.

• Driver & Kernel Management: Build and optimize kernel modules, maintain GPU drivers and runtime stacks for both NVIDIA (CUDA) and AMD (ROCm).

• Software Stack Maintenance: Maintain and optimize ML frameworks and libraries PyTorch, JAX, CUDA toolkit, cuDNN, ROCm, NCCL, and supporting runtime systems.

• Advanced Debugging: Troubleshoot complex interactions involving GPUs, compilers, ML frameworks, and distributed training runtimes (e.g., vLLM compilation failures, CUDA memory leaks, ROCm kernel crashes).

Ideal candidate profile

• 5+ years of experience in HPC, GPU cluster operations, Linux systems engineering, or similar roles.

• Bachelor’s or Master’s degree in Computer Science, Computer Engineering, Electrical Engineering, or a related technical field.

• Strong expertise with NVIDIA (H100/B200) or AMD (MI325x/MI355x) GPUs, including driver and kernel-level debugging.

• Deep understanding of Linux internals, kernel modules, hardware bring-up, and systems performance tuning.

• Experience with network security, including VPNs, iptables/firewalld, SSH, and identity management (LDAP/FreeIPA/AD).

• Proficiency in Bash and Python for scripting, automation, and workflow tooling.

• Familiarity with ML software stacks: CUDA toolkit, cuDNN, NCCL, ROCm, JAX/PyTorch runtime behavior.

• Deep debugging experience with NVLink/NVSwitch fabrics and RDMA networking.

Nice-to-have

• Experience with job schedulers such as Slurm, Kubernetes, or Run:AI.

• Exposure to vLLM, model serving optimizations, or inference systems.

• Hands-on experience with configuration management tools (Ansible, SaltStack, Terraform).

• Previous experience supporting ML research teams in a startup or research-heavy environment.

Benefits include

• Medical, dental, and vision insurance

• 401k plan

• Daily lunch, snacks, and beverages

• Flexible time off

• Competitive salary and equity

Equal opportunity

Sciforium is an equal opportunity employer. All applicants will be considered for employment without attention to race, color, religion, sex, sexual orientation, gender identity, national origin, veteran or disability status.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply