Principal Software Engineer, ML
Splunk
Overview
Responsibilities
- Design and implement high-performance backend architectures that seamlessly integrate with AI-powered products. Focus on building modular, fault-tolerant, and efficient services that support large-scale AI workloads while ensuring low-latency interactions between data pipelines, inference engines, and enterprise applications.
- Develop robust model-serving APIs and containerized microservices that enable real-time AI inference and batch processing with high throughput and low latency.
- Implement end-to-end monitoring, logging, and alerting solutions to ensure AI systems operate reliably at scale.
- Improve scalability by designing distributed systems that efficiently handle AI workloads and inference pipelines.
- Own Kubernetes-based deployments by developing and maintaining Helm charts, Kubernetes operators, and cloud-native workflows to streamline AI model deployment.
- Automate infrastructure management using Infrastructure as Code (IaC) tools like Terraform or CloudFormation.
- Optimize CI/CD pipelines for AI applications, ensuring smooth model retraining, testing, and deployment cycles.
- Improve security and compliance by implementing best practices in access control, container security, and vulnerability management.
- Partner closely with AI/ML teams to ensure seamless model integration into production environments.
- Lead architecture discussions and provide strategic technical guidance on AI platform evolution.
- Mentor and guide engineers to enhance team skills in backend development, DevOps, and cloud technologies.
Requirements
- Strong backend development experience in Python (preferred) or Java, with expertise in building RESTful APIs, microservices, and event-driven architectures.
- Deep understanding of Kubernetes and container orchestration, with experience in deploying AI/ML workloads at scale.
- Expertise in DevOps and CI/CD pipelines, including experience with Jenkins, GitHub Actions, ArgoCD, or similar tools.
- Cloud expertise (AWS/GCP/Azure), including hands-on experience with cloud-native services for AI workloads (e.g., S3, Lambda, EKS/GKE/AKS, DynamoDB, RDS etc.).
- Experience in performance tuning and system optimization for large-scale AI/ML workloads.
- Proven ability to collaborate with ML engineers, data scientists, data engineers and product teams to deliver AI-powered solutions efficiently.
- Experience in technical leadership, driving architectural decisions, and mentoring engineers.
- Strong problem-solving skills, with the ability to balance trade-offs between scalability, maintainability, and performance.
Preferred Experience
- Prior experience working with AI/ML pipelines, model serving frameworks, or distributed AI workloads.
- Experience in AI observability, monitoring model drift, and optimizing inference latency.
- Understanding of cybersecurity, observability, or related domains to enhance AI-driven decision-making.
Splunk, a Cisco company, is an Equal Opportunity Employer and all qualified applicants will receive consideration for employment without regard to race, color, religion, gender, sexual orientation, national origin, genetic information, age, disability, veteran status, or any other legally protected basis.