Última atualização: 14 de Junho de 2024

AI DevOps / SRE Engineer

🌍 100% Remoto💬 Inglês✈️ Vaga internacional

Via Epam

Remuneração

A combinar

Sobre

Responsibilities

  • Implement and maintain CI/CD pipelines for AI and machine learning projects, ensuring robust deployment strategies and continuous integration
  • Monitor and ensure the reliability, availability, and performance of AI applications, particularly those involving LLMs and RAG
  • Collaborate with AI research teams to operationalize machine learning models and systems efficiently
  • Develop and enforce best practices for version control, configuration management, and testing of AI-driven software solutions
  • Utilize MLOps tools such as Kubeflow, MLflow, or TensorFlow Extended (TFX) to streamline the machine learning lifecycle from experimentation to production
  • Implement monitoring solutions that track both system metrics and model performance to facilitate proactive issue resolution
  • Participate in on-call rotations to support the operational health of critical systems, employing SRE principles to meet service-level objectives (SLOs) and reduce downtime

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related field
  • Proven experience as a DevOps Engineer or SRE, with a strong background in software development and automation
  • Experience with deployment and management of LLMs, including technologies like RAG
  • Proficient in CI/CD tools (e.g., Jenkins, GitLab CI, CircleCI) and infrastructure as code (e.g., Terraform, Ansible)
  • Knowledge of container orchestration technologies (e.g., Kubernetes, Docker)
  • Familiarity with MLOps tools and practices to support machine learning lifecycle management
  • Strong problem-solving skills and ability to work in a dynamic, fast-paced environment

Nice to have

  • Experience with cloud services (AWS, GCP, Azure) particularly in AI/ML deployments
  • Background in monitoring tools like Prometheus, Grafana, and ELK stack
  • Knowledge of Python, particularly in data science and machine learning contexts
  • Certification in Kubernetes, AWS/GCP/Azure, or similar technologies

Benefícios

  • Language courses;
  • Health & life Insurance;
  • Occupational Risk Insurance (ART);
  • Paid time off;
  • Sick & exceptional leave;
  • Stable full-time workload;
  • Unlimited access to LinkedIn learning solutions;
  • Certification opportunities.

Outras Informações

Selecionamos as principais informações da posição. Para conferir o descritivo completo, clique em "acessar".

Hey!

Cadastre-se na Remotar para ter acesso a todos os recursos da plataforma, inclusive inscrever-se em vagas exclusivas e selecionadas!