Senior Site Reliability Engineer
Sustainabletalent
Posted: November 28, 2025
Interested in this position?
Create a free account to apply with AI-powered matching
Required Skills
Job Description
Join the Sustainable Talent team, supporting NVIDIA as a Senior Site Reliability Engineer supporting the Infrastructure, Planning, and Process organization. This is a W-2 full-time contract based onsite in Santa Clara, CA. We offer competitive pay $75 - $90/hr based on factors like experience, education, location, etc. and provide full benefits, PTO, and amazing company culture!
As an SRE, you will be troubleshooting and managing our client's on-premises infrastructure to support various software engineering teams' company wide. Keen attention to detail, problem-solving abilities, and a solid knowledge base are essential.
What you’ll be doing:
• Working on systems deployed in NVIDIA's internal cloud making them available and reliable for our end users.
• Monitor system performance and troubleshoot issues related to CPU, memory, disk, and network utilization.
• Providing high quality of user support.
• Monitoring KPIs and making sure that team’s SLAs are met.
• Managing and maintaining production Kubernetes clusters.
• Drive automation of monitoring to gain more insight into applications and system health.
• Craft and implement critical metrics using various analytics methods and dashboards.
• Reuse AI techniques to extract useful signals about machines and jobs from the data generated.
What we need to see:
• Proven SRE experience as an L1 support with on-call responsibilities, ideally over 5+ years.
• Proficient in troubleshooting Linux OS issues such as SSH and performance.
• Experience troubleshooting networking issues like DNS, DHCP, and familiarity with networking principles and protocols, including TCP/IP and VLANs.
• Hands-on experience with monitoring and alerting tools such as Prometheus, Grafana, Elastic, or similar.
• Strong understanding and practical experience with REST API calls.
• Proficiency in basic scripting, with familiarity in Python or similar programming languages being a plus.
• Knowledge of Ansible roles and playbooks, Jenkins CI/CD processes, and deployment experience with Kubernetes.
• Experience with the Kickstart process for automated Linux installations.
• Experience managing and troubleshooting Linux systems, as well as managing systems in data centers, using tools like BMC (Redfish), KVM, and IPMI.
• Background in databases such as SQL (MySQL) and timeseries DBs like Prometheus.
• Experience with data analytics and visualization tools like Kibana, Grafana, and Splunk.
• Proficient with source code management and binary repository systems like GitLab, GitHub, Artifactory, and Perforce.
• Advanced knowledge of standard methodologies related to security.
• Bachelor’s degree in Computer Science, Information Technology, or related field, or equivalent experience.
Ways to stand out from the crowd:
• Working knowledge of OpenStack.
• Previous experience managing NVIDIA hardware such as GPUs and Tegras.
• Prior experience with large scale operations teams.
• Experience managing Windows server infrastructure.
• Outstanding interpersonal skills and ability to communicate effectively with all levels of management.
• Ability to analyze complex problems, design simple systems that function efficiently with minimal support, and thrive in a multi-tasking environment with evolving priorities.
Sustainable Talent is a M/F+, disabled, and veteran equal employment opportunity and affirmative action employer.