Cohere is at the forefront of AI technology, training and deploying frontier models for developers and enterprises. The Infrastructure team plays a crucial role in building the foundation that supports all of Cohere's technical operations. This role combines Site Reliability Engineering, Storage Engineering, and MLOps expertise to maintain and scale highly available distributed systems.
The position requires extensive experience with production infrastructure at scale, particularly working with Kubernetes and GPU workloads. You'll be responsible for designing and managing complex Linux-based distributed computing environments, while working closely with ML Engineers and data scientists. The role involves participating in a 24x7 on-call rotation (with compensation).
Cohere offers a flexible, remote-friendly environment with offices in major tech hubs including Toronto, New York, San Francisco, and London. The company values diversity and fosters an inclusive culture where teams collaborate across different time zones. Benefits include comprehensive health coverage, parental leave, personal enrichment allowances, and generous vacation time.
The ideal candidate brings 5+ years of engineering experience, strong problem-solving abilities, and adaptability. Specific expertise in areas such as distributed filesystems, cloud provider integration, or analytics and observability tools is highly valued. You'll be joining a team of world-class professionals working on cutting-edge AI technology, making a direct impact on the company's mission to scale intelligence to serve humanity.