Anthropic is seeking an Interpretability Research Engineer to join their mission of creating safe and beneficial AI systems. The role focuses on reverse engineering how trained models work, specifically in mechanistic interpretability, which aims to discover how neural network parameters map to meaningful algorithms. The team recently demonstrated significant achievements with Claude 3.0 Sonnet model, extracting millions of meaningful features and creating "Golden Gate Claude."
The position offers a unique opportunity to work at the intersection of AI research and engineering, collaborating with teams across Anthropic including Alignment Science and Societal Impacts. The work involves implementing research experiments, optimizing large-scale systems, and building tools for model safety improvement.
The ideal candidate will have 5-10+ years of software development experience, strong programming skills (particularly in Python), and experience with AI research projects. They should be comfortable with fast-paced, collaborative work and have a genuine interest in machine learning research and its ethical implications.
Anthropic operates as a public benefit corporation, emphasizing the importance of big science and cohesive teamwork. They value impact over smaller, specific puzzles and approach AI research as an empirical science. The company offers competitive compensation ($315,000-$560,000), comprehensive benefits, and a collaborative work environment in San Francisco.
The role requires at least 25% office presence and includes opportunities to work with cutting-edge AI technology, particularly in model interpretability and safety. Anthropic actively encourages applications from diverse backgrounds and provides visa sponsorship support. The position offers a chance to contribute to significant AI research while focusing on making advanced systems safe and beneficial for society.