Production Systems Engineer, Fleet AI Systems
Meta
- Interface with external vendors and internal partners, including but not limited to hardware, mechanical, power, thermal, manufacturing, and software engineers, to understand system architecture, to develop, and to execute test suites for various architectures.
- Proactively create experiments and tooling to detect and diagnose hardware, firmware, and software issues with system health.
- Develop test framework for large-scale test automation inside the fleet, during product development and after mass production.
- Implement remediation across software and hardware stacks, according to plans, while keeping a thorough procedural record and data log.
- Develop and publish updates on resolutions and communicate findings internally.
- Troubleshoot, diagnose and root cause system failures. Isolate the components and failure scenarios while working with internal and external stakeholders.
- Develop visibility through data visualization and implement systemic solutions to hardware systems health issues.
- Drive necessary discussion with external and internal teams on test specification and methodologies, to improve test quality continuously.
- Currently has, or is in the process of obtaining a Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience. Degree must be completed prior to joining Meta.
- 2+ years of experience in hardware server system support, troubleshooting server architecture and components, analyzing, triaging, and solving systems level issues.
- Expertise with Linux and scripting (Python or similar).
- 2+ years of experience in changing system configurations and measuring change impact, working through full lifecycle progressions of computer systems products.
- 4+ years of experience in production support at scale (e.g. 10K storage servers and over 100K HDD), working through full system technologies.
- 3+ years of experience in hyperscale post-production environments, delivering solutions to complex systems issues.
- 2+ years of experience supporting AI or HPC systems and/or related systems, at scale.
- Experience working in a matrix organization.
APPLY NOW
Find your role
Explore jobs that match your skills and experience. Search by technology, team or location to find an opening that’s right for you.
View jobs