Sr Manager, AI Systems Quality & Reliability , Annapurna AI Servers and Systems
Amazon • Austin, Texas, United States
No Relocation
Posted: July 1, 2026
Additional Content
Description
- AWS Annapurna Labs is seeking a Senior Manager of Quality & Reliability Engineering to lead the QnR function within the Trainium Manufacturing, Quality and Reliability organization. You will own quality
Description
- AWS Annapurna Labs is seeking a Senior Manager of Quality & Reliability Engineering to lead the QnR function within the Trainium Manufacturing, Quality and Reliability organization. You will own quality and reliability outcomes for all Trainium AI server products — from component qualification through fleet performance — leading an engineering team across multiple concurrent chip and system generations. This role defines reliability strategy for liquid-cooled and air-cooled platforms at rapidly scaling volumes, builds quality systems across a multi-supplier global manufacturing base, drives fleet failure investigations to root cause, and establishes the reliability characterization capabilities required for next-generation technologies. Key job responsibilities - Lead and grow a QnR engineering team, hiring, developing, and retaining top reliability and quality engineering talent. - Set technical direction for component qualification, reliability testing (HALT, HTOL, thermal cycling, QRV), DFMEA, and vendor quality standards across all Trainium programs. - Own quality and reliability outcomes end-to-end — from DFM input during design through fleet reliability performance. - Drive component specific manufacturing process quality improvements in partnership with Manufacturing Engineering, establishing incoming quality requirements and process controls at all supplier sites. - Build and maintain the reliability prediction and monitoring infrastructure — ensuring fleet performance is tracked against predictions, degradation trends are identified early, and corrective actions are data-driven. - Establish systematic failure analysis processes that connect field failures back to manufacturing history, supplier data, and component-level root cause for rapid containment. - Scale qualification processes to keep pace with multi-supplier, multi-generation production — including automation of qualification workflows and standardization of test methodologies across vendors. About the team Annapurna Labs is a wholly owned subsidiary of AWS, focused on developing custom silicon and servers including the Nitro, Graviton, and Trainium families of processors. Machine Learning Annapurna (MLA) functions as a vertically integrated team including software, firmware, hardware, and silicon design in a single organization. We are the Trainium Servers and Systems organization under MLA focused on Hardware Development, Software Development, Fleet Ops Systems, and Manufacturing, Quality, and Reliability. This position leads the Quality and Reliability Engineering function within the Manufacturing, Quality and Reliability team.
Basic Qualifications
- - Experience in root cause analysis and error correction, identifying changes to procedures and systems to implement long-term fixes and avoid repeating issues - - Bachelor's degree in Reliability Engineering, Electrical Engineering, Mechanical Engineering, Materials Science, Physics, or related field - - 10+ years of reliability or quality engineering experience with server compute platforms, semiconductor packaging, or high-volume electronics manufacturing - - 5+ years of people management experience leading reliability, quality, or hardware engineering teams - - Experience establishing quality management systems and reliability programs across multiple manufacturing vendors or sites