Última atualização: 2 de Maio de 2025

Site Reliability Engineer Lead

🌍 100% Remoto💬 Inglês✈️ Vaga internacional

Via Workable

Sobre

What the role involves:

As Site Reliability Engineer Lead at IOG you will have strong programming and operations skills. As part of our Service Reliability team, you will work closely with geographically diverse experts and the Research & Development teams to ensure high-quality, stable environments for our customers. 

  • Working on ‘build and deployment cycles’ across all development environments
  • Supporting the build, deployment, and configuration management for multi-tier applications
  • Participating in the building of tools and processes to support the infrastructure. 
  • Improving and maintaining tooling and scripts for automation purposes
  • Develop tooling for internal and external users to monitor and maintain production systems.
  • Supporting our teams to write software that is simple and flexible to configure and deploy
  • Collaborating with agile teams to establish and maintain automated regression suite infrastructure and performance testing infrastructure
  • Building capabilities to allow development teams to be self-sufficient

Who you are:

  • Bachelor’s Degree or higher in Computer Science, Software Engineering, or related technical field, or equivalent practical experience
  • 5+ years of professional experience in SRE, DevOps, Platform Engineering, or Infrastructure roles
  • 2+ years in a technical leadership or senior engineering capacity
  • Proven track record of building and operating highly available, distributed, fault-tolerant systems
  • Strong foundation in Linux system internals, networking (TCP/IP, DNS, HTTP), and systems programming
  • Demonstrated experience in open-source contribution is highly desirable
  • Experience leading incident responses, writing post-mortems, and driving reliability improvements
  • Experience working with Agile, Kanban, or similar development methodologies
  • You will be someone who works well on your own and with a team
  • You value cooperation and collaboration above all, and are not afraid to ask for clarification or help when needed
  • You are kind and respectful of others’ opinions, and you are open and act with integrity when engaging in academic or technical discussions
  • Strong scripting and programming skills: Bash, Python, Go, or Rust preferred
  • Extensive experience with Git: branching strategies, GitOps workflows, code review best practices
  • Experience with CI/CD systems, such as GitHub Actions, GitLab CI, Jenkins, Buildkite, or equivalent
  • Cloud platform proficiency: AWS, GCP, Azure — including compute, storage, networking, and IAM
  • Containerization and orchestration: deep experience with Docker and Kubernetes (k8s), Helm
  • Infrastructure as Code (IaC): using Terraform, Pulumi, or similar tools
  • Configuration management: Ansible, Chef, or SaltStack (with preference for declarative approaches)
  • Monitoring, logging, and observability: Prometheus, Grafana, Loki, OpenTelemetry, Datadog, or similar
  • Security best practices: secrets management (Vault, SOPS), least privilege, security incident handling
  • Incident Management and Root Cause Analysis (RCA): strong ownership in production reliability
  • Automated testing and validation: unit testing, integration testing, chaos engineering exposure
  • Experience managing large-scale Linux-based systems: operational excellence in Ubuntu, Debian, or NixOS environments
  • Advocate of DevOps/SRE culture: focus on reducing toil, Service Level Objectives (SLOs), error budgets
  • Strong communication skills: written and verbal, capable of collaborating across distributed teams

Outras Informações

Selecionamos as principais informações da posição. Para conferir o descritivo completo, clique em "acessar" 


Hey!

Cadastre-se na Remotar para ter acesso a todos os recursos da plataforma, inclusive inscrever-se em vagas exclusivas e selecionadas!