Reliability Engineer

AI research company focused on creating reliable, interpretable, and steerable AI systems for safe and beneficial use.
$323,850 - $494,300
Site Reliability
Senior Software Engineer
Hybrid
101 - 500 Employees
5+ years of experience
AI

Description For Reliability Engineer

Anthropic, a pioneering AI research company, is seeking a Reliability Engineer to join their mission of creating safe and beneficial AI systems. This role combines traditional Site Reliability Engineering with the unique challenges of AI infrastructure. The position offers an opportunity to work on cutting-edge AI technologies while ensuring their reliable and efficient operation.

The role involves developing and maintaining critical infrastructure for AI model serving and training, implementing sophisticated monitoring systems, and managing high-availability deployments across multiple regions. You'll be responsible for establishing and achieving reliability metrics for both internal and external products, while leveraging modern AI capabilities to innovate in the field of site reliability engineering.

The ideal candidate brings deep expertise in distributed systems, understanding of AI infrastructure, and strong experience with SLO/SLA frameworks. You'll work in a collaborative environment with researchers, engineers, and policy experts, contributing to Anthropic's mission of beneficial AI development. The company offers competitive compensation (£255,000 - £390,000), comprehensive benefits, and a flexible hybrid work arrangement in San Francisco.

What makes this role unique is the opportunity to work on large-scale AI infrastructure (>1000 GPUs) and contribute to groundbreaking AI research. You'll be part of a team that values impact over incremental improvements, treating AI research as an empirical science. The company's commitment to diversity and inclusion, along with their focus on the ethical implications of AI systems, makes this an ideal position for those who want to make a meaningful impact in the field of AI reliability.

Last updated a day ago

Responsibilities For Reliability Engineer

  • Develop Service Level Objectives for large language model serving and training systems
  • Design and implement monitoring systems for availability, latency and other metrics
  • Design and implement high-availability language model serving infrastructure
  • Develop and manage automated failover and recovery systems across multiple regions
  • Lead incident response for critical AI services
  • Build and maintain cost optimization systems for large-scale AI infrastructure

Requirements For Reliability Engineer

Kubernetes
Linux
  • Extensive experience with distributed systems observability and monitoring at scale
  • Understanding of AI infrastructure operations
  • Proven experience with SLO/SLA frameworks
  • Experience with both traditional and AI-specific metrics
  • Experience with chaos engineering and resilience testing
  • Ability to bridge ML engineers and infrastructure teams
  • Excellent communication skills

Benefits For Reliability Engineer

Medical Insurance
Visa Sponsorship
Parental Leave
  • Competitive compensation and benefits
  • Optional equity donation matching
  • Generous vacation and parental leave
  • Flexible working hours
  • Office space for collaboration

Interested in this job?

Jobs Related To Anthropic Reliability Engineer

Reliability Engineer

Senior Reliability Engineer position at Anthropic focusing on maintaining and improving AI infrastructure reliability and performance.

Senior Site Reliability Engineer

Senior Site Reliability Engineer role at Microsoft focusing on Windows services reliability, offering remote work and competitive compensation.

Site Reliability Engineer L4/L5 - Live Cloud Platform SRE

Senior Site Reliability Engineer position at Netflix focusing on cloud platform reliability for live streaming events, offering competitive compensation and comprehensive benefits.

Sr. Systems Reliability Engineer

Senior SRE position at Disney focusing on building and maintaining reliable systems for theme park experiences

SR. SITE RELIABILITY ENGINEER (STARSHIELD)

Senior Site Reliability Engineer position at SpaceX working on Starshield program, requiring Top Secret clearance and expertise in cloud infrastructure and containerization.