Staff Software Engineer, ML Ops and Infrastructure

Training and deploying frontier models for developers and enterprises building AI systems for content generation, semantic search, RAG, and agents.
DevOps
Staff Software Engineer
Remote
5+ years of experience
AI

Description For Staff Software Engineer, ML Ops and Infrastructure

Cohere is at the forefront of AI development, training and deploying frontier models for enterprises and developers. As a Staff Software Engineer in ML Ops and Infrastructure, you'll be instrumental in building the foundation that powers Cohere's AI systems. The role demands expertise in large-scale infrastructure management, with a focus on Kubernetes and GPU workloads. You'll be working in EMEA, joining a team that values technical excellence and collaborative problem-solving.

The position requires strong experience with cloud platforms (GCP, Azure, AWS, OCI) and Linux environments. You'll be responsible for developing self-service systems, custom Kubernetes operators, and ensuring robust observability. The role includes participation in a 24x7 on-call rotation (with compensation) and requires 5+ years of engineering experience.

Cohere offers an inclusive work environment with impressive benefits, including comprehensive health coverage, parental leave, and flexible remote work options. The company maintains offices in major tech hubs and provides 6 weeks of vacation. They value diversity and encourage applications from all backgrounds, providing accommodations as needed during recruitment.

The ideal candidate will have proven production experience with Kubernetes, hands-on coding experience in Go, and a passion for building systems that enhance team productivity. You'll be working with cutting-edge AI technology while contributing to open-source solutions and mentoring team members. The role offers an opportunity to shape the future of AI infrastructure while working with some of the best talents in the field.

Last updated 2 months ago

Responsibilities For Staff Software Engineer, ML Ops and Infrastructure

  • Build self-service systems that automate managing, deploying and operating services
  • Build custom Kubernetes operators that support language model deployments
  • Automate environment observability and resilience
  • Ensure defined SLOs are met, including participation in 24x7 on-call rotation
  • Build strong relationships with internal developers
  • Influence the Infrastructure team's roadmap
  • Develop team through knowledge sharing and active review process

Requirements For Staff Software Engineer, ML Ops and Infrastructure

Go
Kubernetes
Linux
  • 5+ years of engineering experience running production infrastructure at large scale
  • Experience designing large, highly available distributed systems with Kubernetes, and GPU workloads
  • Experience working with GCP, Azure, AWS and/or OCI
  • Experience in designing, deploying, supporting, and troubleshooting in complex Linux-based computing environments
  • Excellent collaboration and troubleshooting skills
  • The grit and adaptability to solve complex technical challenges

Benefits For Staff Software Engineer, ML Ops and Infrastructure

Dental Insurance
Medical Insurance
Mental Health Assistance
Parental Leave
  • Weekly lunch stipend, in-office lunches & snacks
  • Full health and dental benefits
  • Mental health budget
  • 100% Parental Leave top-up for 6 months (Canada, US, and UK)
  • Personal enrichment benefits
  • Remote-flexible work
  • Co-working stipend
  • 6 weeks of vacation
  • Offices in Toronto, New York, San Francisco and London

Interested in this job?

Jobs Related To Cohere Staff Software Engineer, ML Ops and Infrastructure

Software Engineer (L5), Tools, Integrations, and Productivity

Staff Software Engineer position at Netflix focusing on developer tools, integrations, and productivity infrastructure.

DSP Design Verification - Tools and Infrastructure Sr Staff Engineer

Senior Staff Engineer role at Qualcomm focusing on DSP design verification tools and infrastructure, requiring expertise in Python, Kubernetes, and database management.

Exadata DBA / DevOps - Software Developer 4

Staff Software Engineer position at Oracle focusing on Exadata Cloud Service DevOps and automation, offering competitive compensation and comprehensive benefits.

Lead Engineer (DevOps)

Lead DevOps Engineer position at Velotio Technologies, focusing on OpenShift and Kubernetes administration with 6+ years of experience required.

Sr Industrial Design Engineer, WW Central Engineering

Senior Industrial Design Engineer position at Amazon Logistics, leading material handling solutions and automation projects across Europe.