Senior Software Engineer- AI Hardware
Bloomberg
- Design, build, and maintain highly reliable, scalable, and efficient infrastructure platforms that support our engineering teams and business needs.
- Participate in system design discussions and contribute to architectural decisions
- Ensure code quality through standard methodologies, code reviews, and alignment to clean code principles
- Be able to produce clear and consumable documentation for a wide audience
- Communicate effectively across diverse teams
- Be willing to participate in on-call rotations as arranged
- Be a self starter, manage priorities, and work independently
- Stay up-to-date with the latest infrastructure technologies, and industry standard processes, and evaluate their potential impact on existing and future solutions
- Hold yourself to high standards
- Exude our ambitious, collaborative, and empathetic values
- A self-starter mentality with an eagerness to solve previously unsolved problems
- Excellent collaboration skills and are open to giving and receiving critical feedback across teams
- Scalability and reliability are hardwired into your DNA
- You have publicly available writing samples, blog posts, demos, or recordings of presentations on technical topics
- A unique opportunity to be part of a rapidly growing team in one of the most exciting engineering teams in Bloomberg.
- An inclusive and supportive work culture that fosters learning and growth.
- Continuous professional development, product training, and career pathing
- Intra-departmental mentor and buddy program for in-house networking
- An inclusive company culture, ability to join our Community Guilds
- 4+ years of proficiency in Kubernetes environments (deployments, storage, services, jobs, ingress, egress, etc)
- BA, BS, MS, PHD, in Computer Science, Electrical Engineering or related field
- Hands-on management of GPU-based systems, including kernel and driver management, and developing software tooling to automate provisioning and maintenance of these systems.
- Design, implemented, and maintained system software that enables communication between GPUS, CPUs, and storage in scale-out AI and HPC systems
- Oversee the ongoing monitoring, support, and maintenance of our HPC/AI clusters, ensuring peak performance and reliability
- Drive system upgrades, customization, and seamless integration with software developers, network operations, and data center teams
- Manage and maintain a diverse range of computer systems and application software, ensuring they meet the highest standards of functionality and efficiency
- Develop and maintain expertise in low-latency/high-bandwidth, interconnected infrastructure (including InfiniBand, Ethernet, RDMA/RoCE, and others)
- Monitor and evaluate the efficiency and effectiveness of infrastructure service delivery methods and procedures
- Partner with internal teams to develop prioritization, metrics, and processes around capacity planning and infrastructure availability. Periodically present capacity planning and performance reports to senior leaders during presentations and meetings
- Benchmark, analyze, and make recommendations for improvement of IT infrastructure
- Expertise with Kubernetes design patterns (operators, helm charts, kustomize, etc)
- Experience with data center planning, including rack elevations, cabling plan, and cables/transceivers
- Experience with data center operations and management