Reliability Engineer

Anthropic

AI research company focused on creating reliable, interpretable, and steerable AI systems for safe and beneficial use.

San Francisco, CA, USA

$323,850 - $494,300

Site Reliability

Senior Software Engineer

Hybrid

101 - 500 Employees

5+ years of experience

Description For Reliability Engineer

Anthropic, a pioneering AI research company, is seeking a Reliability Engineer to join their mission of creating safe and beneficial AI systems. This role combines traditional Site Reliability Engineering with the unique challenges of AI infrastructure. The position offers an opportunity to work on cutting-edge AI technologies while ensuring their reliable and efficient operation.

The role involves developing and maintaining critical infrastructure for AI model serving and training, implementing sophisticated monitoring systems, and managing high-availability deployments across multiple regions. You'll be responsible for establishing and achieving reliability metrics for both internal and external products, while leveraging modern AI capabilities to innovate in the field of site reliability engineering.

The ideal candidate brings deep expertise in distributed systems, understanding of AI infrastructure, and strong experience with SLO/SLA frameworks. You'll work in a collaborative environment with researchers, engineers, and policy experts, contributing to Anthropic's mission of beneficial AI development. The company offers competitive compensation (£255,000 - £390,000), comprehensive benefits, and a flexible hybrid work arrangement in San Francisco.

What makes this role unique is the opportunity to work on large-scale AI infrastructure (>1000 GPUs) and contribute to groundbreaking AI research. You'll be part of a team that values impact over incremental improvements, treating AI research as an empirical science. The company's commitment to diversity and inclusion, along with their focus on the ethical implications of AI systems, makes this an ideal position for those who want to make a meaningful impact in the field of AI reliability.

Last updated 3 months ago

Responsibilities For Reliability Engineer

Develop Service Level Objectives for large language model serving and training systems
Design and implement monitoring systems for availability, latency and other metrics
Design and implement high-availability language model serving infrastructure
Develop and manage automated failover and recovery systems across multiple regions
Lead incident response for critical AI services
Build and maintain cost optimization systems for large-scale AI infrastructure

Requirements For Reliability Engineer

Kubernetes

Linux

Extensive experience with distributed systems observability and monitoring at scale
Understanding of AI infrastructure operations
Proven experience with SLO/SLA frameworks
Experience with both traditional and AI-specific metrics
Experience with chaos engineering and resilience testing
Ability to bridge ML engineers and infrastructure teams
Excellent communication skills

Benefits For Reliability Engineer

Medical Insurance

Visa Sponsorship

Parental Leave

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Office space for collaboration

Anthropic

AI research company focused on creating reliable, interpretable, and steerable AI systems for safe and beneficial use.

San Francisco, CA, USA

$323,850 - $494,300

Site Reliability

Senior Software Engineer

Hybrid

101 - 500 Employees

5+ years of experience

Interested in this job?

Jobs Related To Anthropic Reliability Engineer

Reliability Engineer

Anthropic

Senior Reliability Engineer position at Anthropic focusing on maintaining and improving AI infrastructure reliability and performance.

Site Reliability Developer (JoinOCI-Ns2)

Oracle

Senior Site Reliability Developer position at Oracle, focusing on cloud infrastructure and distributed systems, requiring TS/SCI clearance and 5+ years of experience.

Site Reliability Engineer

AION

Senior Site Reliability Engineer role at AION, building and maintaining infrastructure for a decentralized AI cloud platform with focus on automation and reliability.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Google

Senior Software Developer role in Site Reliability Engineering at Google Cloud, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Google

Senior SRE role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.