Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Cohere

Training and deploying frontier models for developers and enterprises building AI systems for content generation, semantic search, RAG, and agents.

Toronto, ON, Canada • New York, NY, USA • San Francisco, CA, USA…

Site Reliability

Staff Software Engineer

Hybrid

5+ years of experience

Description For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Cohere is at the forefront of AI technology, training and deploying frontier models for developers and enterprises. The Infrastructure team plays a crucial role in building the foundation that supports all of Cohere's technical operations. This role combines Site Reliability Engineering, Storage Engineering, and MLOps expertise to maintain and scale highly available distributed systems.

The position requires extensive experience with production infrastructure at scale, particularly working with Kubernetes and GPU workloads. You'll be responsible for designing and managing complex Linux-based distributed computing environments, while working closely with ML Engineers and data scientists. The role involves participating in a 24x7 on-call rotation (with compensation).

Cohere offers a flexible, remote-friendly environment with offices in major tech hubs including Toronto, New York, San Francisco, and London. The company values diversity and fosters an inclusive culture where teams collaborate across different time zones. Benefits include comprehensive health coverage, parental leave, personal enrichment allowances, and generous vacation time.

The ideal candidate brings 5+ years of engineering experience, strong problem-solving abilities, and adaptability. Specific expertise in areas such as distributed filesystems, cloud provider integration, or analytics and observability tools is highly valued. You'll be joining a team of world-class professionals working on cutting-edge AI technology, making a direct impact on the company's mission to scale intelligence to serve humanity.

Last updated 3 months ago

Responsibilities For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Build world-class infrastructure critical to Cohere's success
Focus on stability, scalability, and observability
Design and manage distributed systems
Support and troubleshoot complex computing environments

Requirements For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Kubernetes

Linux

5+ years of engineering experience running production infrastructure at a large scale
Experience working with and supporting MLEs or data scientists
Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters
Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based distributed computing environments
Participate in 24x7 on-call rotation

Benefits For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Dental Insurance

Medical Insurance

Mental Health Assistance

Parental Leave

Weekly lunch stipend, in-office lunches & snacks
Full health and dental benefits
Mental health budget
100% Parental Leave top-up for 6 months
Personal enrichment benefits for arts, culture, fitness, and workspace improvement
Co-working stipend
6 weeks of vacation

Cohere

Training and deploying frontier models for developers and enterprises building AI systems for content generation, semantic search, RAG, and agents.

Toronto, ON, Canada • New York, NY, USA • San Francisco, CA, USA…

Site Reliability

Staff Software Engineer

Hybrid

5+ years of experience

Interested in this job?

Jobs Related To Cohere Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Sr Staff Software Engineer, Reliability Engineering

Airbnb

Senior Staff SRE position at Airbnb focusing on reliability architecture, incident management, and technical leadership, offering competitive compensation and remote work flexibility.

Staff Software Engineer, Reliability Engineering

Airbnb

Staff Software Engineer position at Airbnb focusing on Site Reliability Engineering, developing and maintaining tools for service reliability at scale.

Lead Site Reliability Engineer

Wellhub

Lead SRE position at Wellhub, focusing on cloud infrastructure, Kubernetes, and DevOps practices, offering hybrid work and comprehensive benefits.

Senior Site Reliability Developer (JoinOCI-Ns2)

Oracle

Senior SRE role at Oracle focusing on cloud infrastructure, automation, and system reliability with competitive benefits and security clearance requirement.

Staff Technical Operations Engineer

Zoox

Lead 24/7 technical operations team for autonomous vehicle fleet, focusing on real-time monitoring and SRE principles at Zoox.