Anthropic, a pioneering AI research company, is seeking a Reliability Engineer to join their mission of creating safe and beneficial AI systems. This role combines traditional Site Reliability Engineering with the unique challenges of AI infrastructure. The position offers an opportunity to work on cutting-edge AI technologies while ensuring their reliable and efficient operation.
The role involves developing and maintaining critical infrastructure for AI model serving and training, implementing sophisticated monitoring systems, and managing high-availability deployments across multiple regions. You'll be responsible for establishing and achieving reliability metrics for both internal and external products, while leveraging modern AI capabilities to innovate in the field of site reliability engineering.
The ideal candidate brings deep expertise in distributed systems, understanding of AI infrastructure, and strong experience with SLO/SLA frameworks. You'll work in a collaborative environment with researchers, engineers, and policy experts, contributing to Anthropic's mission of beneficial AI development. The company offers competitive compensation (£255,000 - £390,000), comprehensive benefits, and a flexible hybrid work arrangement in San Francisco.
What makes this role unique is the opportunity to work on large-scale AI infrastructure (>1000 GPUs) and contribute to groundbreaking AI research. You'll be part of a team that values impact over incremental improvements, treating AI research as an empirical science. The company's commitment to diversity and inclusion, along with their focus on the ethical implications of AI systems, makes this an ideal position for those who want to make a meaningful impact in the field of AI reliability.