ARCHIVED
This job listing has been archived and is no longer accepting applications.
MisuJob - AI Job Search Platform MisuJob

HPC Solutions Architect

Lavendo

San Francisco, California, United States Remote permanent

Posted: February 6, 2026

Interested in this position?

Create a free account to apply with AI-powered matching

Quick Summary

HPC Solutions Architect is responsible for designing and implementing high-performance computing solutions for large-scale AI and simulation workloads.

Job Description

About the Company

Our client is building the kind of infrastructure most engineers only read about. They run an AI‑centric cloud that combines huge GPU clusters, high‑speed networks, and cloud‑native tooling into a platform used by enterprises, fast‑growing startups, and advanced research teams. The focus is simple: make it possible to train and run serious AI and simulation workloads without every customer having to build their own supercomputer.

They’re publicly traded and growing quickly with R&D hubs across North America, Europe, and the Middle East. The culture is very engineering‑driven: low on bureaucracy, high on ownership, and built around people who like hard infrastructure problems and seeing their work show up in real customer workloads. You’ll be working with colleagues who care about doing things properly at scale, not just shipping another dashboard.

The Opportunity – HPC Specialist Solutions Architect (Remote from the US)

You’ll be the person customers turn to when they want to stand up or scale out serious GPU and HPC environments in the cloud: multi‑rack clusters, fast interconnects, complex scheduling, and demanding SLAs around throughput and latency.

As an HPC Specialist Solutions Architect, you’ll design and tune next‑generation platforms for AI training, large simulations, and data‑heavy workloads. You’ll work directly with NVIDIA’s latest hardware (Hopper, Blackwell, and successors), NVLink/NVSwitch topologies, and InfiniBand/RoCE fabrics, and you’ll have a real say in how the platform and reference architectures evolve. If you enjoy going from “here’s the workload” to “here’s the cluster and how we squeeze the last 20–30% out of it,” this will feel like home.

What You’ll Work On

• Design real clusters: Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm. You’ll think about everything from node types and GPU topology to queues, partitions, and failure modes.

• Shape GPU‑accelerated infrastructure: Integrate NVIDIA Hopper and Blackwell‑class GPUs with NVLink/NVSwitch and InfiniBand/RoCE, making sure the hardware layout actually matches the communication patterns of the workloads you run.

• Automate GPU and network lifecycle: Deploy and manage GPU Operator and Network Operator so that drivers, CUDA, firmware, and high‑speed networking are consistent and automated across large fleets, not managed box by box.

• Make the cloud behave like a supercomputer: Design and validate cloud‑native HPC environments that still deliver low latency, high bandwidth, and predictable scheduling. You’ll look at utilization, preemption, fragmentation, and squeeze out performance.

• Set the standard for AI/HPC architectures: Define and document reference architectures for AI model training, data pipelines, and MLOps, including observability and CI/CD. When customers ask “how should we do this?”, your work will be what “good” looks like.

• Work directly with vendors and partners: Collaborate with NVIDIA and other partners to evaluate new GPU generations, interconnects, and software stacks. You’ll help decide what is ready for prime time and under which conditions.

• Debug the hard problems: Benchmark performance, track down bottlenecks across compute, network, and storage, and recommend concrete changes that move the needle—not just check a box.

• Be a trusted voice to customers: Lead design sessions, architecture reviews, and operational excellence check‑ins with customers who care a lot about performance and reliability. You’ll translate between “this job keeps timing out” and “here’s what we’ll change in the topology and scheduler.”

What You Bring

• A Bachelor’s or Master’s in Computer Science, Engineering, or a related field (PhD is a plus).

• 3+ years actually building or running HPC or large GPU clusters—on‑prem, cloud, or hybrid. You’ve owned outcomes, not just submitted jobs.

• Strong Linux background, plus Kubernetes and container runtimes (containerd, CRI‑O, Docker) in real environments, with CI/CD in the loop.

• A solid handle on HPC networking and RDMA: InfiniBand, RoCE, NVLink/NVSwitch. You understand why topology and fabric design matter, and you’ve seen what happens when they’re wrong.

• Experience with storage and I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage, or similar systems where throughput, latency, and contention actually matter.

• Comfort with Terraform, Ansible, Helm, and GitOps‑style workflows to keep configurations reproducible and sane.

• Good scripting skills in Python or Bash; you’re happy to automate checks, glue systems together, or prototype tooling.

• You write and speak clearly, can lead a design review without losing the room, and can keep both engineers and non‑technical stakeholders on the same page.

• Legal authorization to work in the U.S. on a full-time basis without visa sponsorship.

Nice to Have

• Hands‑on with the NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, and managing CUDA stacks across production clusters.

• Experience with MLflow, Kubeflow, NeMo, or similar for AI/ML pipelines, or with distributed training frameworks like PyTorch DDP, DeepSpeed, or Megatron.

• Time spent with Slurm, LSF, PBS, or similar on real clusters, not just in a lab.

• Experience with multi‑tenant GPU environments or “AI training farms.”

• Familiarity with observability stacks for HPC: Prometheus, DCGM Exporter, Grafana, and NGC tools.

• Any open‑source work in HPC, CUDA, or Kubernetes is a strong plus.

Who This Role Suits

• You like understanding a workload deeply, then designing a cluster and config that fits it like a glove.

• You’re comfortable saying, “This is fast, but we can make it faster—and here’s how,” and then proving it with numbers.

• You enjoy working directly with customers and partners, but you still want to stay close to the technology.

• You prefer a low‑ego, high‑ownership environment where people care more about doing the right thing than about title.

Why You Might Want This Job

• Serious compensation: OTE in the $225,000–$315,000 range, plus equity, calibrated to your experience and location.

• Real benefits: 100% employer‑paid medical, dental, and vision for you and your family; 4% 401(k) match with immediate vesting; company‑paid short‑ and long‑term disability and life insurance.

• Time for life: 20 weeks paid parental leave for primary caregivers, 12 weeks for secondary.

• Remote‑first: Work from where you are in the US, with support for your home office (mobile + internet stipend).

• Hardware you actually want to work on: H200, B200, GB200‑class GPUs, NVLink/NVSwitch, InfiniBand/RoCE, and clusters that are genuinely in “top of the market” territory.

• Impact: The platforms you design will be used to train cutting‑edge models and run workloads that actually push the limits of current hardware.

Interview Process

• Step 1 – HR screen

• Step 2 – Hiring manager interview

• Step 3 – Technical assignment / challenge

• Step 4 – Leadership meeting

• References & background check. Offer

We are proud to be an equal opportunity workplace and are committed to equal employment opportunity regardless of race, color, religion, national origin, age, sex, marital status, ancestry, physical or mental disability, genetic information, veteran status, gender identity, or expression, sexual orientation, or any other characteristic protected by applicable federal, state or local law.

Why Apply Through MisuJob?

AI-Powered Job Matching: MisuJob uses advanced artificial intelligence to analyze your skills, experience, and career goals. Our matching algorithm compares your profile against thousands of job requirements to find positions where you have the highest chance of success. This saves you hours of manual job searching and ensures you only see relevant opportunities.

One-Click Applications: Once you create your profile, applying to jobs is effortless. Your resume and cover letter are automatically tailored to highlight the most relevant experience for each position. You can apply to multiple jobs in minutes, not hours.

Career Intelligence: Beyond job matching, MisuJob provides valuable career insights. See how your skills compare to market demands, identify skill gaps to address, and understand salary benchmarks for your experience level. Make data-driven decisions about your career path.

Frequently Asked Questions

How do I apply for this position?

Click the "Register to Apply" button above to create a free MisuJob account. Once registered, you can apply with one click and track your application status in your dashboard.

Is MisuJob free for job seekers?

Yes, MisuJob is completely free for job seekers. Create your profile, get matched with jobs, and apply without any cost. We help you find your dream job without any hidden fees.

How does AI matching work?

Our AI analyzes your resume, skills, and experience to understand your professional profile. It then compares this against job requirements using natural language processing to calculate a match percentage. Higher matches mean better fit for the role.

Can I apply to jobs in other countries?

Absolutely. MisuJob features jobs from companies worldwide, including remote positions. Filter by location or look for remote opportunities to find jobs that match your preferences.

Ready to Apply?

Join thousands of job seekers using MisuJob's AI to find and apply to their dream jobs automatically.

Register to Apply