The Fleet Reliability Operations team at CoreWeave is responsible for the day-to-day provisioning, management, and uptime of CoreWeave's ever-expanding fleet of server nodes. This team plays a central role in CoreWeave's growth strategy, working on the front line for configuration, updates, and remote troubleshooting of their highest tier of supercomputing clusters and their networking, delivery platforms, and tools dependencies.
Key Responsibilities: • Configure and maintain large-scale, high-performance supercomputing clusters running state-of-the-art GPUs • Troubleshoot hardware and software issues; escalate and coordinate as needed with data center, network, hardware, and platform teams to drive resolution • Monitor and analyze system performance and take appropriate remediation actions for cloud health • Create and maintain documentation of team processes, knowledge, and best practices for system management • Think critically about day-to-day work and collaborate to improve team processes and efficiency
Required Skills and Experience: • 2 or more years of experience troubleshooting or administering data center or on-prem infrastructure (servers, storage, network or a mix) • Strong understanding of Linux system administration and networking concepts • Ability to troubleshoot hardware and software issues and perform system maintenance tasks consistently and reliably • Bachelor's degree in a related field or equivalent experience
Ideal candidates may also have experience in: • Software development or scripting languages (bash, python, powershell, etc) • Grafana, prometheus, promsql queries or similar observability platforms • Data center environments including server racks, HVAC systems, fiber trays • Kubernetes administration • HPC - administering GPU-related workloads
CoreWeave offers a competitive salary range of $80,000-$110,000 for this position, along with a comprehensive benefits package including medical, dental, and vision insurance, life insurance, disability insurance, 401(k) with employer match, flexible PTO, and various other perks.
The company operates as a hybrid workplace, offering employees flexibility in structuring their time between in-office and remote work. For those not living within 30 miles of an office, remote work may be considered for candidates with strongly aligned skills and experience.