Última atualização: 27 de Outubro de 2025
Site Reliability Engineer - Data Platform
Via Ashbyhq
Sobre
The team
Join our Data Infrastructure team and play a pivotal role in upholding the reliability, scalability, and efficiency of our robust Data platform. As a Senior Site Reliability Engineer (SRE) specialized in Data Infrastructure, you will collaborate closely with diverse cross-functional teams to conceive, execute, and oversee the foundational data infrastructure that empowers our array of applications and services.
The opportunity
- Implement data infrastructure solutions (self service) that support the needs of 10+ business units and over 100 engineering and data analysts
- Utilize Infrastructure as Code (IaC) principles to design, provision, and manage both on-premises and cloud (AWS) infrastructure components using tools such as Terraform
- Develop and maintain automation scripts using bash/shell scripting and to automate operational tasks and deployments
- Enhance and manage CI/CD pipelines to facilitate consistent software deployments across the data infrastructure
- Implement robust data monitoring and alerting solutions to proactively detect anomalies and performance issues
- Manage and implement role-based access control (RBAC) and permissions for a multitude of user groups and machine workflows across different environments
- Manage and maintain real-time streaming data architecture using technologies like Kafka and Debezium Change Data Capture (CDC)
- Ensure the timely and accurate processing of streaming data, enabling data analysts and engineers to gain insights from up-to-date information
- Utilize Kubernetes to manage containerized applications within the data infrastructure, ensuring efficient deployment, scaling, and orchestration
- Implement effective incident response procedures and participate in on-call rotations
- Collaborate with data analysts, engineers, and cross-functional teams to understand requirements and implement appropriate solutions
- Document architecture, processes, and best practices to enable knowledge sharing and support continuous improvement
- Support AI/ML teams with their infra requests
Skills you should HODL
- Proven experience (5+ years) working as a Site Reliability Engineer, Infrastructure Engineer, Data Infrastructure Engineer, or similar roles, with a focus on data infrastructure and security
- Experience with maintaining real-time data processing technologies, such as Kafka and Flink clusters and Debezium instances
- Working experience in managing hybrid multi-tenant cloud systems particularly on AWS
- Infrastructure as Code tools such as Terraform, Terragrunt and Atlantis
- Experience with containerization and orchestration tools, particularly Kubernetes, Nomad, and Docker
- Solid understanding of bash/shell scripting and proficiency in at least one programming language (preferably Python or JVM languages)
- Experience maintaining data-related technologies: Apache Airflow, Apache Spark, DBs, BI tooling
- Experience solving data access management issues at large scale data-lake
- Familiarity with CI/CD deployment pipelines and related tools
- Strong problem-solving skills and the ability to troubleshoot complex systems
- Experience with data-related technologies (databases, data lakes, airflow, spark) is a plus
Outras Informações
Selecionamos as principais informações da posição. Para conferir o descritivo completo, clique em "acessar"
Hey!
Cadastre-se na Remotar para ter acesso a todos os recursos da plataforma, inclusive inscrever-se em vagas exclusivas e selecionadas!