OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of custom-built hyperscale supercomputers. The team focuses on maximizing supercomputing capacity for research and ensuring minimal impact of hardware faults on researchers. As a software engineer on the Hardware Health team, you will maintain a sophisticated suite of hardware health tests and collaborate with researchers and the Supercomputing team on root-causing and reproducing newly discovered problems. The role requires 7+ years of industry experience in software engineering, excellent abilities in Python and shell scripting, and comfort with data analysis using SQL, PromQL, and Pandas. The ideal candidate should have a strong sense of ownership, attention to detail, and prior team lead experience. The team operates within the broader Platform organization, incubated inside OpenAI's Research team, supporting the engineering and research required to train large-scale AI models. OpenAI offers a fast-paced environment with high autonomy and the opportunity to affect change in cutting-edge AI research infrastructure.
Key Responsibilities:
Required Skills:
Bonus Skills:
OpenAI is committed to diversity, equal opportunity, and providing reasonable accommodations to applicants with disabilities. The company offers a competitive compensation range of $295K – $440K for this role.