Operations Engineer, Fleet Reliability

CoreWeave is a company specializing in high-performance computing and cloud infrastructure, with a focus on GPU-accelerated workloads and supercomputing clusters.
Roseland, NJ, USABrooklyn, NY, USASunnyvale, CA, USA
$80,000 - $110,000
DevOps
Mid-Level Software Engineer
Hybrid
2+ years of experience
AI · Enterprise SaaS

Description For Operations Engineer, Fleet Reliability

The Fleet Reliability Operations team at CoreWeave is responsible for the day-to-day provisioning, management, and uptime of CoreWeave's ever-expanding fleet of server nodes. This team plays a central role in CoreWeave's growth strategy, working on the front line for configuration, updates, and remote troubleshooting of their highest tier of supercomputing clusters and their networking, delivery platforms, and tools dependencies.

Key Responsibilities: • Configure and maintain large-scale, high-performance supercomputing clusters running state-of-the-art GPUs • Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware, and platform teams to drive resolution • Monitor and analyze system performance and take appropriate remediation actions for cloud health • Approach work with flexibility and optimism, anticipating shifting business and technical priorities • Create and maintain documentation of team processes, knowledge, and best practices for system management • Think critically about day-to-day work and collaborate to improve team processes and efficiency

Required Skills and Experience: • 2 or more years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix) • Strong understanding of Linux system administration and networking concepts • Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably • Bachelor's degree in a related field or equivalent experience

Ideal candidates may also have experience in: • Software development or scripting languages (bash, python, powershell, etc) • Grafana, prometheus, promsql queries or similar observability platforms • Data center environments including server racks, HVAC systems, fiber trays • Kubernetes administration • HPC - administering GPU-related workloads

CoreWeave offers a competitive salary range of $80,000-$110,000, based on factors such as market location, job-related knowledge, skills, and experience. They also provide a comprehensive benefits package including medical, dental, and vision insurance, life insurance, disability insurance, 401(k) with employer match, flexible PTO, and various other perks and support programs.

CoreWeave operates as a hybrid workplace, offering employees flexibility in structuring their time between in-office and remote work. They prioritize fostering connections, collaboration, and creativity within their office culture while allowing employees to tailor their work-life balance to individual preferences.

Last updated 2 months ago

Responsibilities For Operations Engineer, Fleet Reliability

  • Configure and maintain large-scale, high-performance supercomputing clusters running state-of-the-art GPUs
  • Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware, and platform teams to drive resolution
  • Monitor and analyze system performance and take appropriate remediation actions for cloud health
  • Approach work with flexibility and optimism anticipating shifting business and technical priorities
  • Create and maintain documentation of team processes, knowledge and best practices for system management
  • Think critically about day-to-day work and collaborate to improve team processes and efficiency

Requirements For Operations Engineer, Fleet Reliability

Linux
Kubernetes
  • 2 or more years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix)
  • Strong understanding of Linux system administration and networking concepts
  • Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably
  • Bachelor's degree in a related field or equivalent experience

Benefits For Operations Engineer, Fleet Reliability

Medical Insurance
Dental Insurance
Vision Insurance
401k
Mental Health Assistance
Parental Leave
  • Medical insurance
  • Dental insurance
  • Vision insurance
  • Life Insurance
  • Short and long-term disability insurance
  • Flexible Spending Account
  • Tuition Reimbursement
  • Mental Wellness Benefits through Spring Health
  • Family-Forming support provided by Carrot
  • Paid Parental Leave
  • Flexible, full-service childcare support with Kinside
  • 401(k) with a generous employer match
  • Flexible PTO
  • Catered lunch each day in office and data center locations

Interested in this job?

Jobs Related To CoreWeave Operations Engineer, Fleet Reliability

Infrastructure Engineer

Infrastructure Engineer position at CoreWeave, focusing on hardware/firmware management services development with Go, offering competitive salary and comprehensive benefits.

Operations Engineer, Fleet Reliability

CoreWeave seeks an Operations Engineer for Fleet Reliability to manage and troubleshoot high-performance computing clusters, offering competitive salary and benefits.

IT Engineer

IT Systems Engineer role at CoreWeave, focusing on developing and automating technologies with a salary range of $110,000-$150,000.

System Development Engineer, Annapurna Labs, Machine Learning Accelerator Systems - Fleet Triage

System Development Engineer role at AWS's Annapurna Labs, focusing on ML infrastructure automation and system operations at global scale.

Data Center Operations Support Engineer, DCO

AWS Data Center Operations Support Engineer position focusing on infrastructure management, technical support, and operational excellence for cloud services.