Quick Summary

AI Performance Engineer is responsible for designing and implementing scalable and efficient AI and HPC solutions for high-performance computing workloads.

Required Skills

Performance Tuning Distributed Systems Benchmarking Optimization Serving Architectures RDMA GPU Direct RDMA Inference Optimization Experiment Planning Visualization

Job Description

Cornelis Networks delivers the world’s highest performance scale-out networking solutions for AI and HPC datacenters. Our differentiated architecture seamlessly integrates hardware, software and system level technologies to maximize the efficiency of GPU, CPU and accelerator-based compute clusters at any scale. Our solutions drive breakthroughs in AI & HPC workloads, empowering our customers to push the boundaries of innovation. Backed by top-tier venture capital and strategic investors, we are committed to innovation, performance and scalability - solving the world’s most demanding computational challenges with our next-generation networking solutions.

We are a fast-growing, forward-thinking team of architects, engineers, and business professionals with a proven track record of building successful products and companies. As a global organization, our team spans multiple U.S. states and six countries, and we continue to expand with exceptional talent in onsite, hybrid, and fully remote roles.

We’re seeking an AI Performance Engineer that will optimize training and multi-node inference across next-gen networking silicon and systems—adapters, switches, and the software stack that ties it all together. You’ll partner with architecture, firmware, software, and lighthouse customers to turn lab results into field-proven wins with an emphasis on distributed serving architectures and P99-aware optimizations.

Key Responsibilities:

Own end-to-end performance for distributed AI workloads (training + multi-node inference) across multi-node clusters and diverse fabrics (Omni-Path, Ethernet, InfiniBand).

Benchmark, characterize, and tune open-source & industry workloads (e.g., Llama, Mixtral, diffusion, BERT/T5, MLPerf) on current and future compute, storage, and network hardware, including vLLM/TensorRT-LLM/Triton serving paths.

Design and optimize distributed serving topologies (sharded/replicated, tensor/pipe parallel, MoE expert placement), continuous/adaptive batching, KV-cache sharding/offload (CPU/NVMe) & prefix caching, and token streaming with tight p99/p999 SLOs.

Optimize inferencing: Validate RDMA/GPUDirect RDMA, congestion control, and collective/point-to-point tradeoffs during inference.

Design experiment plans to isolate scaling bottlenecks (collectives, kernel hot spots, I/O, memory, topology) and deliver clear, actionable deltas with latency-SLO dashboards and queuing analysis.

Build crisp proof points that compare Cornelis Omni-Path to competing interconnects; translate data into narratives for sales/marketing and lighthouse customers, including cost-per-token and tokens/sec-per-watt for serving.

Instrument and visualize performance (Nsight Systems, ROCm/Omnitrace, VTune, perf, eBPF, RCCL/NCCL tracing, app timers) plus serving telemetry (Prometheus/Grafana, OpenTelemetry traces, concurrency/queue depth).

Evangelize best practices through briefs, READMEs, and conference-level presentations on distributed inference patterns and anti-patterns.

Minimum Qualifications:

B.S. in CS/EE/CE/Math or related

5–7+ years running AI/ML at cluster scale.

Proven ability to set up, run, and analyze AI benchmarks; deep intuition for message passing, collectives, scaling efficiency, and bottleneck hunting for both training and low-latency serving.

Hands-on with distributed training beyond single-GPU (DP/TP/PP, ZeRO, FSDP, sharded optimizers) and distributed inference architectures (replicated vs sharded, tensor/KV parallel, MoE).

Practical experience across AI stacks & comms: PyTorch, DeepSpeed, Megatron-LM, PyTorch Lightning; RCCL/NCCL, MPI/Horovod; Triton Inference Server, vLLM, TensorRT-LLM, Ray Serve, KServe.

Comfortable with compilers (GCC/LLVM/Intel/OneAPI) and MPI stacks; Python + shell power user.

Familiarity with network architectures (Omni-Path/OPA, InfiniBand, Ethernet/RDMA/ROCE) and Linux systems at the performance-tuning level, including NIC offloads, CQ moderation, pacing, ECN/RED.

Excellent written and verbal communication—turn measurements into
persuasion with SLO-driven narratives for inference.

Preferred Qualifications:

M.S. in CS/EE/CE/Math or related

Scheduler expertise (SLURM, PBS) and multi-tenant cluster ops.

Hands-on profiling & tracing of GPU/comm paths (Nsight Systems, Nsight Compute, ROCm tools/rocprof/roctracer/omnitrace, VTune, perf, PCP, eBPF).

Experience with NeMo, DeepSpeed, Megatron-LM, FSDP, and collective ops analysis (AllReduce/AllGather/ReduceScatter/Broadcast).

Background in HPC performance engineering or storage (BeeGFS, Lustre, NVMeoF) for data & checkpoint pipelines.

Location: This is a remote position for employees residing within the United States.

We offer a competitive compensation package that includes equity, cash, and incentives, along with health and retirement benefits. Our dynamic, flexible work environment provides the opportunity to collaborate with some of the most influential names in the semiconductor industry.

At Cornelis Networks your base salary is only one component of your comprehensive total rewards package. Your base pay will be determined by factors such as your skills, qualifications, experience, and location relative to the hiring range for the position. Depending on your role, you may also be eligible for performance-based incentives, including an annual bonus or sales incentives.

In addition to your base pay, you’ll have access to a broad range of benefits, including medical, dental, and vision coverage, as well as disability and life insurance, a dependent care flexible spending account, accidental injury insurance, and pet insurance. We also offer generous paid holidays, 401(k) with company match, and Open Time Off (OTO) for regular full-time exempt employees. Other paid time off benefits include sick time, bonding leave, and pregnancy disability leave.

Cornelis Networks does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. Cornelis Networks is an equal opportunity employer, and all qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity or expression, pregnancy, age, national origin, disability status, genetic information, protected veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants’ needs under the respective laws throughout all stages of the recruitment and selection process.

AI Performance Engineer

Interested in this position?

Quick Summary

Required Skills

Job Description

Why Apply Through MisuJob?

Frequently Asked Questions

How do I apply for this position?

Is MisuJob free for job seekers?

How does AI matching work?

Can I apply to jobs in other countries?

Ready to Apply?