Meta is seeking a Systems Engineer to join our Release to Production (RTP) team working on Meta Training and Inference Accelerator (MTIA) program as a part of the AI/ML initiatives supporting large scale AI Training and Inference. The RTP team is responsible for the end-to-end Hardware Lifecycle of all Meta servers, including prototyping, debugging, and stress testing. We are looking for a candidate to work on scale up and scale out network technologies for MTIA systems powering Meta's AI advancements.
Responsibilities:
- Support new MTIA platform introduction
- Create experiments and tooling for hardware/firmware/software health issues
- Develop understanding of AI workload traffic
- Contribute to enabling hacks for future AI technology explorations
- Troubleshoot and diagnose system failures
- Develop visibility through data visualization
- Drive continuous product quality improvement
Requirements:
- Bachelor's degree in Engineering or Computer Science
- 6+ years of work experience in relevant domains
- Knowledge of server architecture and components
- Experience with Linux, TCP/IP, and iperf
- Hands-on troubleshooting and debug experience
Preferred Qualifications:
- Experience with Network Interface Cards (NICs)
- Experience with RDMA/RoCE
- Experience with full server systems, including PCIe
- Experience with large scale deployments
Join Meta to shape the future of social technology beyond 2D screens, pushing the boundaries of augmented and virtual reality.