Software Engineer, Hardware Health

AI research and deployment company dedicated to ensuring general-purpose artificial intelligence benefits all of humanity.
$295,000 - $440,000
Backend
Senior Software Engineer
In-Person
7+ years of experience
AI

Description For Software Engineer, Hardware Health

OpenAI's Hardware Health team is dedicated to ensuring the optimal performance and reliability of custom-built hyperscale supercomputers. The team focuses on maximizing supercomputing capacity for research and ensuring minimal impact of hardware faults on researchers. As a software engineer on the Hardware Health team, you will maintain a sophisticated suite of hardware health tests and collaborate with researchers and the Supercomputing team on root-causing and reproducing newly discovered problems. The role requires 7+ years of industry experience in software engineering, excellent abilities in Python and shell scripting, and comfort with data analysis using SQL, PromQL, and Pandas. The ideal candidate should have a strong sense of ownership, attention to detail, and prior team lead experience. The team operates within the broader Platform organization, incubated inside OpenAI's Research team, supporting the engineering and research required to train large-scale AI models. OpenAI offers a fast-paced environment with high autonomy and the opportunity to affect change in cutting-edge AI research infrastructure.

Key Responsibilities:

  • Maintain and develop hardware health tests
  • Collaborate on root-causing and reproducing hardware problems
  • Analyze data using SQL, PromQL, and Pandas
  • Develop reproducible analyses and build dashboards/visualizations
  • Monitor outcomes of deployed updates

Required Skills:

  • 7+ years of industry experience in software engineering
  • Strong Python and shell scripting abilities
  • Data analysis skills with SQL, PromQL, and Pandas
  • Experience in building dashboards and visualizations
  • Attention to detail in identifying data anomalies
  • Prior team lead experience (2+ years)

Bonus Skills:

  • Expertise in low-level hardware components and protocols
  • Experience with Linux tooling
  • Visualization skills for large data centers and networks

OpenAI is committed to diversity, equal opportunity, and providing reasonable accommodations to applicants with disabilities. The company offers a competitive compensation range of $295K – $440K for this role.

Last updated a month ago

Responsibilities For Software Engineer, Hardware Health

  • Maintain a sophisticated suite of hardware health tests
  • Collaborate with researchers and Supercomputing team on root-causing and reproducing problems
  • Analyze data using SQL, PromQL, and Pandas
  • Develop reproducible analyses and build dashboards/visualizations
  • Monitor outcome of deployed updates

Requirements For Software Engineer, Hardware Health

Python
Linux
  • 7+ years of industry experience in software engineering
  • Excellent abilities developing in Python and shell scripting
  • High degree of comfort with SQL, PromQL, and Pandas
  • Experience developing reproducible analyses / building dashboards and visualizations
  • Strong attention to detail
  • Prior TL experience (2+ years)

Interested in this job?

Jobs Related To OpenAI Software Engineer, Hardware Health

Sr ECAD Application Engineer, Project Kuiper Satellites

Senior ECAD Tools Application Engineer position at Amazon's Project Kuiper, focusing on satellite constellation development and ECAD tool management.

System Development Engineer, Private Pricing Product Management (3PM)

Senior Systems Development Engineer role at AWS focusing on Private Pricing Product Management, building scalable solutions and tools using modern technologies.

Senior Product Manager - Tech

Lead Amazon's Buy Now checkout experience as Senior Product Manager, driving innovation in e-commerce with competitive compensation and comprehensive benefits.

Senior Software Development Engineer, AWS Alameda

Senior Software Engineer role at AWS Alameda, focusing on control plane development and distributed systems with 5+ years of experience required.

Software Dev Engineer (L5), Global Talent Management & Compensation

Senior Software Engineer role at Amazon's Edinburgh office, building scalable talent management solutions using AWS technologies.