Staff Systems Engineer, Kubernetes
job description
Job Description
Full Job Description
The SRE (Site Reliability Engineering) team is responsible for availability, reliability, performance, monitoring, emergency response for applications, and reducing manual work by implementing SRE principles and practices. The SRE team directly works with Development teams, Operations teams, Product teams, and other teams to deploy new features, and changes, and maintain infrastructure, operations, CI/CD, and IAC to achieve availability and reliability so that SLOs and SLAs can be protected. We utilize a variety of DevOps automation tools like Ansible, Docker, Kubernetes, Terraform, and Jenkins. The Senior SRE engineer is capable of implementing Observability, SLO, SLI, SLA, and Disaster Recovery and Backup Plans.
Responsibilities:
Design, engineer and implement large scale distributed systems that process high volumes of observability tracing data from container and non-container-based applications focusing on latency, scalability, resiliency, self-service, and fault tolerance.
Design, develop and implement open source-based software components, libraries, and auto instrumentation code for enabling complete observability across application tracing, Metrics and Logs.
Ensure the availability and reliability of distributed systems.
Help Tier 1 team to resolve the client’s infrastructure/system issues, escalations, alerts, tickets, and queries.
Works as a bridge between development, operations and other teams in order to build and maintain resilient systems.
Conduct, coordinate and oversee post-incident Root Cause Analysis / Reviews.
Build and maintain documentation for all assigned projects.
Leverage DevOps, Agile methodology, ITIL disciplines (Event, Incident, Problem, and Change Management) and standards in day-to-day work.
Adopt and propose automation of repetitive tasks to reduce/eliminate toil.
Implement and troubleshoot using observability tools like Prometheus, Grafana etc.
Planning and implementing disaster recovery and backup plans for the platform.
Proactively work on efficiency and capacity planning.
Keep a proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
Liaise and work closely with Tier-1 On call support, Development, and Operations teams.
Drive availability and reliability by defining and implementing SLI, SLO, error budget, Observability, Disaster recovery, and backup to detect and mitigate issues.
Work independently and mentorjunior developers.
This is a hybrid position. Hybrid employees can alternate time between both remote and office. Employees in hybrid roles are expected to work from the office 2-3 set days a week (determined by leadership/site), with a general guidepost of being in the office 50% or more of the time based on business needs.