MLOps/DevOps Engineer - ML Platform

Qualcomm is a leading technology company specializing in wireless telecommunications products and services.
Cork, Ireland
DevOps
Mid-Level Software Engineer
Hybrid
2+ years of experience
AI

Description For MLOps/DevOps Engineer - ML Platform

We are seeking a highly skilled and experienced MLOps/DevOps Engineer to join our team and contribute to the development and maintenance of our ML and Data platform both on premises and AWS Cloud. As a MLOps Engineer, you will be responsible for architecting, deploying, and optimizing the ML & Data platform that supports training of Machine Learning Models using NVIDIA DGX clusters and the Kubernetes platform, including technologies like Helm, ArgoCD, Argo Workflow, Prometheus, and Grafana. Your expertise in AWS services such as EKS, EC2, VPC, IAM, S3, and EFS will be crucial in ensuring the smooth operation and scalability of our ML infrastructure.

You will work closely with cross-functional teams, including data scientists, software engineers, and infrastructure specialists, to ensure the smooth operation and scalability of our ML infrastructure. Your expertise in MLOps, DevOps, and knowledge of GPU clusters will be vital in enabling efficient training and deployment of ML models.

Responsibilities will include:

  • Architect, develop, and maintain the ML & Data platform to support training and inference of ML models.
  • Design and implement scalable and reliable infrastructure solutions for NVIDIA clusters both on premises and AWS Cloud.
  • Collaborate with data scientists and software engineers to define requirements and ensure seamless integration of ML and Data workflows into the platform.
  • Optimize the platform's performance and scalability, considering factors such as GPU resource utilization, data ingestion, model training, and deployment.
  • Monitor and troubleshoot system performance, identifying and resolving issues to ensure the availability and reliability of the ML platform.
  • Implement and maintain CI/CD pipelines for automated model training, evaluation, and deployment using technologies like ArgoCD and Argo Workflow.
  • Implement and maintain monitoring stack using Prometheus and Grafana to ensure the health and performance of the platform.
  • Manage AWS services including EKS, EC2, VPC, IAM, S3, and EFS to support the platform.
  • Implement logging and monitoring solutions using AWS CloudWatch and other relevant tools.
  • Stay updated with the latest advancements in MLOps, distributed computing, and GPU acceleration technologies, and proactively propose improvements to enhance the ML platform.
Last updated 19 hours ago

Interested in this job?

Jobs Related To Qualcomm MLOps/DevOps Engineer - ML Platform

AutoIT Solutioning Engineer

Site Reliability Engineer role at Qualcomm focusing on infrastructure, automation, and system optimization.

AutoIT Solutioning Engineer

Join Qualcomm as a Site Reliability Engineer to provision and maintain infrastructure with stability, sustainability, and security in mind.

IT Operations Engineer

IT Operations Engineer at Hudson River Trading: Provide global technical support, manage tech stack, and enhance internal processes in a fast-paced financial trading environment.

Software Development Engineer in Test - Tools DevOps

NVIDIA is seeking a Software Development Engineer in Test for Tools DevOps to work on compiler testing in the AI space, focusing on automation and CI/CD.