Última atualização: 27 de Outubro de 2025

Site Reliability Engineer - Data Platform

🌍 100% Remoto ✈️ Vaga internacional 💬 Inglês 🧓🏽 Sênior

Via Ashbyhq

Sobre

The team

Join our Data Infrastructure team and play a pivotal role in upholding the reliability, scalability, and efficiency of our robust Data platform. As a Senior Site Reliability Engineer (SRE) specialized in Data Infrastructure, you will collaborate closely with diverse cross-functional teams to conceive, execute, and oversee the foundational data infrastructure that empowers our array of applications and services.

The opportunity

Implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts
Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform
Develop and maintain automation scripts using bash/shell scripting and to automate operational tasks and deployments
Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure
Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues
Manage and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments
Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC)
Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information
Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration
Implement effective incident response procedures and participate in on-call rotations
Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions
Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement
Support AI/ML teams with their infra requests

Skills you should HODL

Proven experience (5+ years) working as a Site Reliability Engineer, Infrastructure Engineer, Data Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security
Experience with maintaining real-time data processing technologies, such as Kafka and Flink clusters and Debezium instances
Working experience in managing hybrid multi-tenant cloud systems particularly on AWS
Infrastructure as Code tools such as Terraform, Terragrunt and Atlantis
Experience with containerization and orchestration tools, particularly Kubernetes, Nomad, and Docker
Solid understanding of bash/shell scripting and proficiency in at least one programming language (preferably Python or JVM languages)
Experience maintaining data-related technologies: Apache Airflow, Apache Spark, DBs, BI tooling
Experience solving data access management issues at large scale data-lake
Familiarity with CI/CD deployment pipelines and related tools
Strong problem-solving skills and the ability to troubleshoot complex systems
Experience with data-related technologies (databases, data lakes, airflow, spark) is a plus

Outras Informações

Selecionamos as principais informações da posição. Para conferir o descritivo completo, clique em "acessar"

Hey!

Cadastre-se na Remotar para ter acesso a todos os recursos da plataforma, inclusive inscrever-se em vagas exclusivas e selecionadas!