Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Training and deploying frontier models for developers and enterprises building AI systems for content generation, semantic search, RAG, and agents.
Site Reliability
Staff Software Engineer
Hybrid
5+ years of experience
AI

Description For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Cohere is at the forefront of AI technology, training and deploying frontier models for developers and enterprises. The Infrastructure team plays a crucial role in building the foundation that supports all of Cohere's technical operations. This role combines Site Reliability Engineering, Storage Engineering, and MLOps expertise to maintain and scale highly available distributed systems.

The position requires extensive experience with production infrastructure at scale, particularly working with Kubernetes and GPU workloads. You'll be responsible for designing and managing complex Linux-based distributed computing environments, while working closely with ML Engineers and data scientists. The role involves participating in a 24x7 on-call rotation (with compensation).

Cohere offers a flexible, remote-friendly environment with offices in major tech hubs including Toronto, New York, San Francisco, and London. The company values diversity and fosters an inclusive culture where teams collaborate across different time zones. Benefits include comprehensive health coverage, parental leave, personal enrichment allowances, and generous vacation time.

The ideal candidate brings 5+ years of engineering experience, strong problem-solving abilities, and adaptability. Specific expertise in areas such as distributed filesystems, cloud provider integration, or analytics and observability tools is highly valued. You'll be joining a team of world-class professionals working on cutting-edge AI technology, making a direct impact on the company's mission to scale intelligence to serve humanity.

Last updated 3 months ago

Responsibilities For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

  • Build world-class infrastructure critical to Cohere's success
  • Focus on stability, scalability, and observability
  • Design and manage distributed systems
  • Support and troubleshoot complex computing environments

Requirements For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Kubernetes
Linux
  • 5+ years of engineering experience running production infrastructure at a large scale
  • Experience working with and supporting MLEs or data scientists
  • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads on those clusters
  • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based distributed computing environments
  • Participate in 24x7 on-call rotation

Benefits For Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Dental Insurance
Medical Insurance
Mental Health Assistance
Parental Leave
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits
  • Mental health budget
  • 100% Parental Leave top-up for 6 months
  • Personal enrichment benefits for arts, culture, fitness, and workspace improvement
  • Co-working stipend
  • 6 weeks of vacation

Interested in this job?

Jobs Related To Cohere Infrastructure Talent Pool (Storage Engineer, Site Reliability Engineer, MLOps)

Sr Staff Software Engineer, Reliability Engineering

Senior Staff SRE position at Airbnb focusing on reliability architecture, incident management, and technical leadership, offering competitive compensation and remote work flexibility.

Staff Software Engineer, Reliability Engineering

Staff Software Engineer position at Airbnb focusing on Site Reliability Engineering, developing and maintaining tools for service reliability at scale.

Lead Site Reliability Engineer

Lead SRE position at Wellhub, focusing on cloud infrastructure, Kubernetes, and DevOps practices, offering hybrid work and comprehensive benefits.

Senior Site Reliability Developer (JoinOCI-Ns2)

Senior SRE role at Oracle focusing on cloud infrastructure, automation, and system reliability with competitive benefits and security clearance requirement.

Staff Technical Operations Engineer

Lead 24/7 technical operations team for autonomous vehicle fleet, focusing on real-time monitoring and SRE principles at Zoox.