Site Reliability Engineer - Remote

Paynearme - Santa Clara
new offer (28/06/2024)

job description

Job Description
What you'll be responsible for:
Infrastructure Management:
Design, implement, and maintain scalable and resilient infrastructure using Terraform for infrastructure as code, ensuring high availability and performance.
Kubernetes and Containers:
Deploy, manage, and optimize Kubernetes clusters and containerized applications using Docker. Implement best practices for container orchestration and management.
Systems and Application Monitoring/Observability:
Develop and maintain comprehensive monitoring and observability solutions using Datadog. Ensure detailed visibility into system performance and application health.
SLOs and SLA Management:
Define, monitor, and maintain Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to ensure reliable and consistent service delivery.
Incident Response and Troubleshooting:
Respond to incidents, perform root cause analysis, and implement solutions to prevent recurrence. Participate in post-incident reviews and contribute to blameless postmortems.
Reliability and Production Environment Management:
Ensure the reliability and stability of our production environments. Continuously assess and improve system reliability, identifying and addressing potential points of failure.
Automation and Scripting:
Develop automation scripts and tools to reduce manual intervention and improve system reliability using Python, Bash, or Go. Implement and improve CI/CD pipelines.
CI/CD Pipeline Management:
Enhance and maintain continuous integration and continuous deployment pipelines using GitLab CI. Ensure seamless and reliable deployment processes.
Capacity Planning and Scaling:
Assist in capacity planning and ensure that systems are scalable to meet future demands. Implement auto-scaling strategies where applicable.
Security and Compliance:
Implement security best practices and ensure compliance with industry standards. Regularly review and update security policies and procedures.
Collaboration and Support:
Work closely with development teams to ensure reliability and scalability of new features and services. Provide technical support and guidance on infrastructure-related issues.
Software Engineering for Operations:
Develop and maintain internal tools and services that enhance the efficiency and reliability of our operations.
On-Call Rotation:
Participate in an on-call rotation to address production issues and collaborate in incident response efforts.

Apply now for
Site Reliability Engineer - Remote

Warning: you will leave the jobtome site.

These offers may interest you:

Go back