Reliability Engineer - Distributed Open-Source Systems
Two Sigma
- Lead engineering and operational support for multiple large distributed open-source software applications (Elasticsearch, Kafka and Zookeeper), including much of the foundational infrastructure used by the Engineering and Research functions at Two Sigma
- Improve all aspects of software reliability, including better monitoring, alerting and documentation
- Collaborate across infrastructure and development teams to ensure strategic priorities are aligned, fix priority support issues, and improve vital software, tools, and processes
- Collect and analyze metrics from operating systems and applications to assist in performance tuning and fault finding
- Participate in a 24x7 on-call rotation for our hosted services
- Minimum 1 year of experience required; 3-10 years of experience preferred in a similar Site Reliability Engineering (SRE), DevOps, Platform Engineering, Systems Engineering/Administration, or related function
- BS in Computer Science or another highly technical, scientific field
- The ability to apply open-source systems (Elasticsearch, Kafka and Zookeeper) and utilities to provision production systems in a variety of domains, especially for multi-tenant use
- Ability to program (structured and OO) with one or more high-level language (such as Python, Java, C/C++, Go) with a proven track record of automation and an algorithmic approach to solving problems
- In-depth knowledge and experience with on-prem (Linux/Unix) and cloud-based (GCP, AWS, etc.) systems
- Experience with automated configuration management tools such as Ansible, Chef, Puppet, and SaltStack