Logo

Senior AI Tools Engineer, SRE Operations - GeForce NOW

Nvidia United States of America, US, CA, Remote, US, CA, Santa Clara


No Relocation

Posted: May 26, 2026

Job Description

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are seeking a passionate AI Tools Engineer to join the Site Reliability Engineering (SRE) Data Team. Applicants with SRE or equivalent experience are encouraged.

What you will be doing:

You will build and deploy sophisticated AI-powered tools and products. These tools support the operation and optimization of a critical production global Geforce Now service. This role is critical for transforming extensive production data streams—such as signals, metrics, and logs—into actionable intelligence. The intelligence automates root cause analysis for incidents and predicts future service trends and patterns.

  • Build and implement robust AI/ML tools capable of analyzing production data to identify root causes for complex incidents and identify future operational trends.

  • Lead the development of brand-new LLM- and Agent-based systems to improve operational efficiency.

  • Establish and maintain excellent data management practices, including building pipelines to transform and handle large-scale data sources vital for model development.

  • Take charge of and enhance LLM-based pipelines while integrating a strong grasp of LLM progress into product development.

  • Act as a resident authority on AI Frameworks, recommending the best platforms, toolsets, and architectural approaches to ensure the long-term technical sustainability of the product.

What we need to see:

  • B.S. in Computer Science, Statistics, or Engineering (or equivalent experience), and 5+ years of experience.

  • Strong proficiency in Python; familiarity with Go or other systems languages is a plus.

  • Practical experience building, optimizing, and deploying AI tools.

  • Strong knowledge of the AI space and current developments, including understanding how LLM-based platforms are built, optimized, and which platforms work best.

  • Hands-on experience with container orchestration (Kubernetes) and cloud environments (AWS cloud).

  • Active engagement with developments in the AI field and the ability to distinguish meaningful advances from noise when making technical decisions.

  • Expertise in automation and handling large-scale data pipelines.

  • Experience applying monitoring and visualization tools, such as Grafana, to interact with data.

  • Excellent ability to handle data sources and pipelines to transform and manage data.

Ways to stand out in a crowd:

  • Understanding of SRE principles and experience managing production environments.

  • Strong in LLM improvement pipelines as well as a strong grasp of recent developments in LLM training.

  • Someone with excellent knowledge of LLMs and AI Models who can reason and recommend an approach that sustains the team and product long term. This person helps prevent grave mistakes by avoiding the wrong platform choice.

  • Understanding of SRE concepts and managing production environments as well as experience with Kubernetes, AWS, and other cloud technologies.

  • Proficiency in automation.

With a competitive salary package and benefits, NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. Are you a creative and autonomous AI Tools Engineer who loves challenges? Do you have a genuine passion for advancing the state of Site Reliability Engineering across a variety of industries? If so, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 144,000 USD - 230,000 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until May 30, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.

Additional Content

NVIDIA has been transforming computer graphics, PC gaming, and accelerated computing for more than 25 years. It’s a unique legacy of innovation that’s fueled by great technology—and amazing people. Today, we’re tapping into the unlimited potential of AI to define the next era of computing. An era in which our GPU acts as the brains of computers, robots, and self-driving cars that can understand the world. Doing what’s never been done before takes vision, innovation, and the world’s best talent. As an NVIDIAN, you’ll be immersed in a diverse, supportive environment where everyone is inspired to do their best work. Come join the team and see how you can make a lasting impact on the world.

We are seeking a passionate AI Tools Engineer to join the Site Reliability Engineering (SRE) Data Team. Applicants with SRE or equivalent experience are encouraged.

What you will be doing:

You will build and deploy sophisticated AI-powered tools and products. These tools support the operation and optimization of a critical production global Geforce Now service. This role is critical for transforming extensive production data streams—such as signals, metrics, and logs—into actionable intelligence. The intelligence automates root cause analysis for incidents and predicts future service trends and patterns.

  • Build and implement robust AI/ML tools capable of analyzing production data to identify root causes for complex incidents and identify future operational trends.

  • Lead the development of brand-new LLM- and Agent-based systems to improve operational efficiency.

  • Establish and maintain excellent data management practices, including building pipelines to transform and handle large-scale data sources vital for model development.

  • Take charge of and enhance LLM-based pipelines while integrating a strong grasp of LLM progress into product development.

  • Act as a resident authority on AI Frameworks, recommending the best platforms, toolsets, and architectural approaches to ensure the long-term technical sustainability of the product.

What we need to see:

  • B.S. in Computer Science, Statistics, or Engineering (or equivalent experience), and 5+ years of experience.

  • Strong proficiency in Python; familiarity with Go or other systems languages is a plus.

  • Practical experience building, optimizing, and deploying AI tools.

  • Strong knowledge of the AI space and current developments, including understanding how LLM-based platforms are built, optimized, and which platforms work best.

  • Hands-on experience with container orchestration (Kubernetes) and cloud environments (AWS cloud).

  • Active engagement with developments in the AI field and the ability to distinguish meaningful advances from noise when making technical decisions.

  • Expertise in automation and handling large-scale data pipelines.

  • Experience applying monitoring and visualization tools, such as Grafana, to interact with data.

  • Excellent ability to handle data sources and pipelines to transform and manage data.

Ways to stand out in a crowd:

  • Understanding of SRE principles and experience managing production environments.

  • Strong in LLM improvement pipelines as well as a strong grasp of recent developments in LLM training.

  • Someone with excellent knowledge of LLMs and AI Models who can reason and recommend an approach that sustains the team and product long term. This person helps prevent grave mistakes by avoiding the wrong platform choice.

  • Understanding of SRE concepts and managing production environments as well as experience with Kubernetes, AWS, and other cloud technologies.

  • Proficiency in automation.

With a competitive salary package and benefits, NVIDIA is widely considered to be one of the technology world’s most desirable employers. We have some of the most forward-thinking and hardworking people in the world working for us. Are you a creative and autonomous AI Tools Engineer who loves challenges? Do you have a genuine passion for advancing the state of Site Reliability Engineering across a variety of industries? If so, we want to hear from you.

Your base salary will be determined based on your location, experience, and the pay of employees in similar positions. The base salary range is 144,000 USD - 230,000 USD.

You will also be eligible for equity and benefits.

Applications for this job will be accepted at least until May 30, 2026.

This posting is for an existing vacancy. 

NVIDIA uses AI tools in its recruiting processes.

NVIDIA is committed to fostering a diverse work environment and proud to be an equal opportunity employer. As we highly value diversity in our current and future employees, we do not discriminate (including in our hiring and promotion practices) on the basis of race, religion, color, national origin, gender, gender expression, sexual orientation, age, marital status, veteran status, disability status or any other characteristic protected by law.