Deep Learning Kernel Software Performance Architect
NVIDIA
Posted: February 6, 2026
Interested in this position?
Create a free account to apply with AI-powered matching
Required Skills
Job Description
NVIDIA is seeking Software Performance Architects to optimize GPU kernel performance for state-of-the-art data-center platforms. We build automated, data-driven workflows to detect, explain, and prevent performance regressions across key deep learning workloads, partnering closely with kernel developers, compiler teams, infrastructure, and architecture/performance groups.
What you'll be doing:
• Performance analysis + debugging
• Validate and analyze performance of GPU-accelerated kernels and key deep learning building blocks.
• Debug performance issues end-to-end: reproduce, isolate root causes, propose fixes or mitigation paths, and drive closure with the owning teams.
• Build performance narratives using structured evidence: baselines, controlled comparisons, and regression attribution.
• Automation + regression infrastructure (Python-heavy)
• Develop and maintain Python-based automation for performance testing and analysis—using modern AI-assisted developer tools (e.g., Cursor/Claude Code/Copilot) to accelerate scripting while keeping code maintainable and reviewable.
• Design and operate performance test workflows: coverage definition, test/workload generation, automated large-scale execution (CI/nightly/on-demand), rerun rules, and reproducibility standards.
• Convert raw run outputs into actionable insight: statistics, noise control, post-processing, visualization, and large-scale result mining.
• Cross-team collaboration and operating model
• Work with kernel developers and compiler/rotation teams to ensure performance checks are practical, scalable, and aligned to release needs.
• Partner with SWQA and infrastructure teams for execution at scale and reliable pipelines/dashboards.
• Contribute to clear ownership/triage/routing rules so regressions close quickly and consistently
• Following general software engineering best practices including support for regression testing and CI/CD flows
What we need to see:
• Masters or PhD degree or equivalent experience in Computer Science, Computer Engineering, Applied Math, or related field
• Strong programming ability in Python plus C/C++ (performance-oriented code reading/debugging)
• Solid fundamentals in computer architecture and performance reasoning (latency/throughput, memory hierarchy, parallelism).
• Experience with performance analysis workflows: profiling, measurement methodology, reproducibility, and regression triage.
• Comfortable working across teams and driving issues to decision/closure with clear communication
• Demonstrated strong C++ programming and software design skills, including debugging, performance analysis, and test design
• Experience with performance-oriented parallel programming, even if it’s not on GPUs (e.g. with OpenMP or pthreads)
• Solid understanding of computer architecture and some experience with assembly programming
• Identify bottlenecks, optimize resource utilization, and improve throughput
Ways to stand out from the crowd:
• Experience with high-performance kernels or math libraries (e.g., GEMM/attention, CUTLASS-like concepts)
• Experience building CI/nightly regression systems, dashboards, or large-scale performance analytics
• GPU programming/perf experience (CUDA or equivalent parallel programming)
• Strong ML/DL workload understanding (training/inference shapes, precision modes, perf bottlenecks)
• Familiarity with simulators/analytical modeling or performance characterization methodology