Última atualização: 20 de Junho de 2025

Senior Site Reliability Engineer - Midnight

🌍 100% Remoto💬 Inglês✈️ Vaga internacional🧓🏽 Sênior

Via Workable

Sobre

What the role involves:

  • Infrastructure & Automation:
    • Design, build, and maintain scalable and highly available systems, primarily on AWS, using best practices.
    • Manage and optimize Kubernetes clusters for high availability and performance, extending them when it makes sense to expand functionality.
    • Leverage GitOps principles to automate deployments and manage container orchestration.
    • Implement and manage CI/CD pipelines ensuring seamless, high-quality deployments, finding and removing bottlenecks, improving performance and working alongside teams to refine feedback loops and automate toil away.
    • Develop automation tools and scripts to improve operational efficiency.
  • Monitoring & Incident Response:
    • Implement robust monitoring solutions with Prometheus and related tooling to ensure system health and performance.
    • Participate in on-call rotations and lead incident response efforts, turning challenges into learning opportunities.
    • Collaborate with dev teams to define and implement SLOs/SLIs
  • Problem Solving & Communication:
    • Take vague or loosely defined problems, work closely with cross-functional teams, and distill them into clear, actionable plans.
    • Communicate technical solutions and incident retrospectives effectively across both technical and non-technical stakeholders.
  • Innovation & Continuous Improvement:
    • Evaluate and adopt new technologies, with a special advantage for candidates with blockchain experience, to keep our systems at the cutting edge.
    • Document processes and best practices, ensuring that knowledge is shared across the team and continuously improved.
    • Strive to strike a balance between effective delivery of goals and a measurable high standard of these goals. Always apply a layer of polish and due diligence when delivering.

Who you are:

  • 7+ years of experience in SRE, DevOps, or a related role.
  • Understanding of SRE best practices, architectures, and methods.
  • Good knowledge on resiliency patterns and cloud security.
  • Strong programming proficiency in Python, Golang, or Javascript.
  • Rust experience is advantageous
  • Demonstrated experience with AWS and modern cloud architectures.
  • Proficiency in Helm, Terraform, and CI/CD tools like Github Actions and ArgoCD
  • Hands-on experience with Kubernetes/EKS and GitOps methodologies.
  • Proven track record with monitoring tools such as Prometheus, OpenTelemetry, as well as familiarity with the LGTM stack, or other comparable tools
  • Blockchain experience is advantageous, offering a unique perspective on distributed systems and security.
  • Exceptional problem-solving skills with a knack for translating vague requirements into clear, strategic plans.
  • Ability to engage in technical discussions and be part of the decision making process
  • Strong problem-solving skills and capability to work on complex systems
  • Experience in working within an Agile environment
  • Experience in working with a distributed team
  • Strong communication and collaboration abilities to work seamlessly across different teams.
  • A proactive and innovative mindset, with a passion for continuous improvement and operational excellence.

Outras Informações

Selecionamos as principais informações da posição. Para conferir o descritivo completo, clique em "acessar" 


Hey!

Cadastre-se na Remotar para ter acesso a todos os recursos da plataforma, inclusive inscrever-se em vagas exclusivas e selecionadas!