Última atualização: 20 de Junho de 2025
Senior Site Reliability Engineer - Midnight
Via Workable
Sobre
What the role involves:
- Infrastructure & Automation:
- Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
- Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
- Leverage GitOps principles to automate deployments and manage container orchestration.
- Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
- Develop automation tools and scripts to improve operational efficiency.
- Monitoring & Incident Response:
- Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
- Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
- Collaborate with dev teams to define and implement SLOs/SLIs
- Problem Solving & Communication:
- Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
- Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
- Innovation & Continuous Improvement:
- Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
- Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
- Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.
Who you are:
- 7+ years of experience in SRE, DevOps, or a related role.
- Understanding of SRE best practices, architectures, and methods.
- Good knowledge on resiliency patterns and cloud security.
- Strong programming proficiency in Python, Golang, or Javascript.
- Rust experience is advantageous
- Demonstrated experience with AWS and modern cloud architectures.
- Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
- Hands-on experience with Kubernetes/EKS and GitOps methodologies.
- Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
- Blockchain experience is advantageous, offering a unique perspective on distributed systems and security.
- Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
- Ability to engage in technical discussions and be part of the decision making process
- Strong problem-solving skills and capability to work on complex systems
- Experience in working within an Agile environment
- Experience in working with a distributed team
- Strong communication and collaboration abilities to work seamlessly across different teams.
- A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.
Outras Informações
Selecionamos as principais informações da posição. Para conferir o descritivo completo, clique em "acessar"
Hey!
Cadastre-se na Remotar para ter acesso a todos os recursos da plataforma, inclusive inscrever-se em vagas exclusivas e selecionadas!