Reliability Engineer

AI research company focused on creating reliable, interpretable, and steerable AI systems for safe and beneficial use.
$323,850 - $494,300
Site Reliability
Senior Software Engineer
Hybrid
101 - 500 Employees
5+ years of experience
AI

Description For Reliability Engineer

Anthropic, a pioneering AI research company, is seeking a Reliability Engineer to join their mission of creating safe and beneficial AI systems. This role combines traditional Site Reliability Engineering with the unique challenges of AI infrastructure. The position offers an opportunity to work on cutting-edge AI technologies while ensuring their reliable and efficient operation.

The role involves developing and maintaining critical infrastructure for AI model serving and training, implementing sophisticated monitoring systems, and managing high-availability deployments across multiple regions. You'll be responsible for establishing and achieving reliability metrics for both internal and external products, while leveraging modern AI capabilities to innovate in the field of site reliability engineering.

The ideal candidate brings deep expertise in distributed systems, understanding of AI infrastructure, and strong experience with SLO/SLA frameworks. You'll work in a collaborative environment with researchers, engineers, and policy experts, contributing to Anthropic's mission of beneficial AI development. The company offers competitive compensation (£255,000 - £390,000), comprehensive benefits, and a flexible hybrid work arrangement in San Francisco.

What makes this role unique is the opportunity to work on large-scale AI infrastructure (>1000 GPUs) and contribute to groundbreaking AI research. You'll be part of a team that values impact over incremental improvements, treating AI research as an empirical science. The company's commitment to diversity and inclusion, along with their focus on the ethical implications of AI systems, makes this an ideal position for those who want to make a meaningful impact in the field of AI reliability.

Last updated 3 months ago

Responsibilities For Reliability Engineer

  • Develop Service Level Objectives for large language model serving and training systems
  • Design and implement monitoring systems for availability, latency and other metrics
  • Design and implement high-availability language model serving infrastructure
  • Develop and manage automated failover and recovery systems across multiple regions
  • Lead incident response for critical AI services
  • Build and maintain cost optimization systems for large-scale AI infrastructure

Requirements For Reliability Engineer

Kubernetes
Linux
  • Extensive experience with distributed systems observability and monitoring at scale
  • Understanding of AI infrastructure operations
  • Proven experience with SLO/SLA frameworks
  • Experience with both traditional and AI-specific metrics
  • Experience with chaos engineering and resilience testing
  • Ability to bridge ML engineers and infrastructure teams
  • Excellent communication skills

Benefits For Reliability Engineer

Medical Insurance
Visa Sponsorship
Parental Leave
  • Competitive compensation and benefits
  • Optional equity donation matching
  • Generous vacation and parental leave
  • Flexible working hours
  • Office space for collaboration

Interested in this job?

Jobs Related To Anthropic Reliability Engineer

Reliability Engineer

Senior Reliability Engineer position at Anthropic focusing on maintaining and improving AI infrastructure reliability and performance.

Site Reliability Developer (JoinOCI-Ns2)

Senior Site Reliability Developer position at Oracle, focusing on cloud infrastructure and distributed systems, requiring TS/SCI clearance and 5+ years of experience.

Site Reliability Engineer

Senior Site Reliability Engineer role at AION, building and maintaining infrastructure for a decentralized AI cloud platform with focus on automation and reliability.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior Software Developer role in Site Reliability Engineering at Google Cloud, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Senior SRE role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.