Reliability Engineer

Anthropic

Anthropic creates reliable, interpretable, and steerable AI systems, focusing on safe and beneficial AI development through research and engineering.

San Francisco, CA, USA

$320,000 - $485,000

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

Description For Reliability Engineer

Anthropic, a pioneering AI research company, is seeking a Senior Reliability Engineer to join their mission of creating safe and beneficial AI systems. This role is crucial for defining and achieving reliability metrics for both internal and external products and services.

The position offers an exciting opportunity to work at the intersection of Site Reliability Engineering and AI systems, focusing on maintaining and improving the infrastructure that powers large language models. You'll be responsible for developing Service Level Objectives, implementing monitoring systems, and managing high-availability infrastructure capable of serving millions of customers.

The ideal candidate brings extensive experience in distributed systems observability, understanding of AI infrastructure challenges, and proven expertise in implementing SLO/SLA frameworks. Strong candidates may have additional experience with large-scale model training infrastructure (>1000 GPUs), ML hardware accelerators, and AI-specific observability tools.

Anthropic offers a competitive compensation package ranging from $320,000 to $485,000 USD, along with benefits including equity options, visa sponsorship, generous vacation time, and flexible working hours. The position is hybrid-based in San Francisco, requiring at least 25% office presence.

The company operates as a public benefit corporation and values diversity and inclusion, encouraging applications from candidates of all backgrounds. They work as a cohesive team on large-scale research efforts, prioritizing impact and collaborative research discussions. This role presents an opportunity to contribute to groundbreaking AI technologies while ensuring their safe and reliable deployment for the benefit of humanity.

Working at Anthropic means joining a team that views AI research as an empirical science, combining elements of physics, biology, and computer science. The company's research builds upon significant work in areas like GPT-3, Circuit-Based Interpretability, and AI Safety, making this an ideal position for those passionate about advancing the field of AI reliability while maintaining high standards of safety and ethics.

Last updated 3 months ago

Responsibilities For Reliability Engineer

Develop Service Level Objectives for large language model serving and training systems
Design and implement monitoring systems for availability, latency and other metrics
Design and implement high-availability language model serving infrastructure
Develop and manage automated failover and recovery systems across multiple regions
Lead incident response for critical AI services
Build and maintain cost optimization systems for large-scale AI infrastructure

Requirements For Reliability Engineer

Kubernetes

Extensive experience with distributed systems observability and monitoring at scale
Understanding of operating AI infrastructure challenges
Proven experience implementing and maintaining SLO/SLA frameworks
Experience with both traditional and AI-specific metrics
Experience with chaos engineering and systematic resilience testing
Ability to bridge gap between ML engineers and infrastructure teams
Excellent communication skills

Benefits For Reliability Engineer

Visa Sponsorship

Equity

Competitive compensation and benefits
Optional equity donation matching
Generous vacation and parental leave
Flexible working hours
Office space for collaboration

Anthropic

Anthropic creates reliable, interpretable, and steerable AI systems, focusing on safe and beneficial AI development through research and engineering.

San Francisco, CA, USA

$320,000 - $485,000

Site Reliability

Senior Software Engineer

Hybrid

5+ years of experience

Interested in this job?

Jobs Related To Anthropic Reliability Engineer

Reliability Engineer

Anthropic

Senior Site Reliability Engineer position at Anthropic, focusing on maintaining and optimizing large-scale AI infrastructure while ensuring reliable and safe AI system operations.

Site Reliability Developer (JoinOCI-Ns2)

Oracle

Senior Site Reliability Developer position at Oracle, focusing on cloud infrastructure and distributed systems, requiring TS/SCI clearance and 5+ years of experience.

Site Reliability Engineer

AION

Senior Site Reliability Engineer role at AION, building and maintaining infrastructure for a decentralized AI cloud platform with focus on automation and reliability.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Google

Senior Software Developer role in Site Reliability Engineering at Google Cloud, focusing on building and maintaining large-scale distributed systems with emphasis on reliability and automation.

Senior Software Developer, Site Reliability Engineering, Google Cloud

Google

Senior SRE role at Google Cloud focusing on building and maintaining large-scale distributed systems with competitive compensation and comprehensive benefits.