Software Engineer, Reliability

AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.
$245,000 - $385,000
Site Reliability
Senior Software Engineer
In-Person
501 - 1,000 Employees
5+ years of experience

Description For Software Engineer, Reliability

OpenAI is seeking a Software Engineer, Reliability to join their Applied Engineering team in San Francisco. This role is crucial for ensuring the reliability, scalability, and performance of OpenAI's systems as they continue to expand their AI technology.

As a reliability expert, you'll be at the forefront of maintaining and enhancing the stability, scalability, and performance of OpenAI's rapidly evolving infrastructure. You'll work closely with cross-functional teams, including software engineers, product managers, and data scientists, to build and maintain resilient systems that can handle a growing user base and workload.

Key responsibilities include:

  • Designing and implementing solutions to ensure infrastructure scalability
  • Collaborating with development teams to enhance system reliability
  • Implementing and managing proactive monitoring systems
  • Developing and maintaining SLOs and SLIs
  • Implementing fault-tolerant and resilient design patterns
  • Building automation tools to improve system reliability
  • Partnering with various teams to bring new features and research capabilities to the world
  • Participating in an on-call rotation for critical incident response

The ideal candidate will have:

  • A Bachelor's degree in Computer Science or related field (or equivalent experience)
  • Proven experience as a reliability engineer in a fast-paced, scaling environment
  • Strong proficiency in cloud infrastructure and programming languages
  • Experience with Kubernetes, containerization, and IaC tools
  • Excellent problem-solving and communication skills
  • Experience with observability tools and microservices architecture

OpenAI offers a competitive salary range of $245K – $385K, along with generous equity and benefits. These include comprehensive health insurance, mental health support, a 401(k) with 50% matching, unlimited time off, paid parental leave, and an annual learning stipend.

Join OpenAI in their mission to ensure that general-purpose artificial intelligence benefits all of humanity. They are committed to creating a diverse, equitable, and inclusive culture where all employees feel welcome and empowered to contribute. If you're passionate about AI technology and want to be part of a team that's shaping the future of AI while prioritizing safety and reliability, this could be the perfect opportunity for you.

Last updated 4 months ago

Responsibilities For Software Engineer, Reliability

  • Design and implement solutions to ensure the scalability of our infrastructure to meet rapidly increasing demands
  • Collaborate with development teams to make the systems they design and operate more reliable
  • Implement and manage monitoring systems to proactively identify issues and anomalies in our production environment
  • Develop and maintain service level objectives (SLOs) and service level indicators (SLIs) to measure and ensure system reliability
  • Implement fault-tolerant and resilient design patterns to minimize service disruptions
  • Build and maintain automation tools to streamline repetitive tasks and improve system reliability
  • Partner with researchers, engineers, product managers, and designers to bring new features and research capabilities to the world
  • Participate in an on-call rotation to respond to critical incidents and ensure 24/7 system availability

Requirements For Software Engineer, Reliability

Kubernetes
Linux
  • Bachelor's degree in Computer Science, Information Technology, or a related field (or equivalent work experience)
  • Proven experience as a reliability engineer or a similar role in a fast-paced, rapidly scaling company
  • Strong proficiency in cloud infrastructure
  • Proficiency in programming/scripting languages
  • Experience with containerization technologies and container orchestration platforms like Kubernetes
  • Knowledge of IaC tools such as Terraform or CloudFormation
  • Excellent problem-solving and troubleshooting skills
  • Strong communication and collaboration skills
  • Experience with observability tools such as DataDog, Prometheus, Grafana, Splunk and ELK stack
  • Experience with microservices architecture and service mesh technologies
  • Knowledge of security best practices in cloud environments

Benefits For Software Engineer, Reliability

Medical Insurance
Dental Insurance
Vision Insurance
401k
Education Budget
Parental Leave
Mental Health Assistance
  • Medical, dental, and vision insurance for you and your family
  • Mental health and wellness support
  • 401(k) plan with 50% matching
  • Unlimited time off and 13 company holidays per year
  • Paid parental leave (20 weeks) and family-planning support
  • Annual learning & development stipend ($1,500 per year)

Interested in this job?

Jobs Related To OpenAI Software Engineer, Reliability

Senior Site Reliability Engineer

Senior Site Reliability Engineer position at Lumin Digital, focusing on maintaining and scaling cloud-native digital banking platforms with emphasis on automation and reliability.

Site Reliability Engineer L4/L5 - Live Streaming Pipeline

Netflix is hiring a Senior Site Reliability Engineer for their Live Streaming Pipeline, offering remote work and competitive compensation.

CDN Site Reliability Engineer (SRE) L4/L5

Netflix seeks CDN Site Reliability Engineer to design, scale, and operate global content delivery network, ensuring seamless streaming for millions.