Staff Software Engineer, ML Ops and Infrastructure

Training and deploying frontier models for developers and enterprises building AI systems for content generation, semantic search, RAG, and agents.
DevOps
Staff Software Engineer
Remote
5+ years of experience
AI

Description For Staff Software Engineer, ML Ops and Infrastructure

Cohere is at the forefront of AI development, training and deploying frontier models for enterprises and developers. As a Staff Software Engineer in ML Ops and Infrastructure, you'll be instrumental in building the foundation that powers Cohere's AI systems. The role demands expertise in large-scale infrastructure management, with a focus on Kubernetes and GPU workloads. You'll be working in EMEA, joining a team that values technical excellence and collaborative problem-solving.

The position requires strong experience with cloud platforms (GCP, Azure, AWS, OCI) and Linux environments. You'll be responsible for developing self-service systems, custom Kubernetes operators, and ensuring robust observability. The role includes participation in a 24x7 on-call rotation (with compensation) and requires 5+ years of engineering experience.

Cohere offers an inclusive work environment with impressive benefits, including comprehensive health coverage, parental leave, and flexible remote work options. The company maintains offices in major tech hubs and provides 6 weeks of vacation. They value diversity and encourage applications from all backgrounds, providing accommodations as needed during recruitment.

The ideal candidate will have proven production experience with Kubernetes, hands-on coding experience in Go, and a passion for building systems that enhance team productivity. You'll be working with cutting-edge AI technology while contributing to open-source solutions and mentoring team members. The role offers an opportunity to shape the future of AI infrastructure while working with some of the best talents in the field.

Last updated 2 hours ago

Responsibilities For Staff Software Engineer, ML Ops and Infrastructure

  • Build self-service systems that automate managing, deploying and operating services
  • Build custom Kubernetes operators that support language model deployments
  • Automate environment observability and resilience
  • Ensure defined SLOs are met, including participation in 24x7 on-call rotation
  • Build strong relationships with internal developers
  • Influence the Infrastructure team's roadmap
  • Develop team through knowledge sharing and active review process

Requirements For Staff Software Engineer, ML Ops and Infrastructure

Go
Kubernetes
Linux
  • 5+ years of engineering experience running production infrastructure at large scale
  • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads
  • Experience working with GCP, Azure, AWS and/or OCI
  • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
  • Excellent collaboration and troubleshooting skills
  • The grit and adaptability to solve complex technical challenges

Benefits For Staff Software Engineer, ML Ops and Infrastructure

Dental Insurance
Medical Insurance
Mental Health Assistance
Parental Leave
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits
  • Mental health budget
  • 100% Parental Leave top-up for 6 months (Canada, US, and UK)
  • Personal enrichment benefits
  • Remote-flexible work
  • Co-working stipend
  • 6 weeks of vacation
  • Offices in Toronto, New York, San Francisco and London

Interested in this job?

Jobs Related To Cohere Staff Software Engineer, ML Ops and Infrastructure

Senior Staff Operations Engineer

Senior Staff Operations Engineer position at Airbnb, focusing on observability architecture and automation within the BizTech department.

Staff Program Manager, BizTech Global Ops

Staff Program Manager position at Airbnb overseeing technical production services, requiring 9+ years experience, offering remote work and competitive compensation.

Staff Software Engineer, Compute & Networking

Staff Software Engineer position at Attentive focusing on compute and networking infrastructure, offering competitive salary and remote work opportunity.

Staff Software Engineer, Compute & Networking

Staff Software Engineer position at Attentive focusing on compute and networking infrastructure, offering competitive salary and remote work opportunity.

Data Center Operations Manager, Global Server Operations

Lead data center operations teams at Google, managing infrastructure, networking, and hardware installations while ensuring optimal performance of critical systems.