Turing
Descrição do Cargo

Engenheiro de Mlops - Remoto
Saia na frente: Vaga ofertada por parceiro do Jobbol:
Turing is looking for an MLOps Engineer to join our growing AI research engineering team.
Your primary responsibility will be to manage and optimize our Ray clusters on GCP/GKE, which we use for multi-node, multi-GPU fine-tuning, inference, and reinforcement learning with large language models (LLMs).
In addition, you ll help streamline our experimental workflows by maintaining reproducible environments, resolving dependency issues, and automating key parts of the infrastructure.
This role is ideal for someone who is excited about working closely with AI researchers and helping scale the infrastructure behind cutting-edge LLM training and experimentation.Key Responsibilities Manage and maintain Ray clusters deployed on GCP/GKE to support distributed LLM training and inference.
Optimize multi-node, multi-GPU workloads for both fine-tuning and inference pipelines using Ray, Kubernetes, and GCP services.
Assist the research team with environment debugging, dependency management, and containerization (e.g., CUDA/PyTorch/Flash-Attn stacks).
Build and maintain reusable infrastructure templates (e.g., Terraform modules, Helm charts) for reproducible research environments.
Monitor system performance and optimize cluster resource allocation and autoscaling.
Support CI/CD workflows for experiment tracking and deployment pipelines.
Collaborate with research engineers to improve the usability, reliability, and scalability of our training infrastructure.Requirements 6+ years of experience in DevOps/MLOps roles with a focus on machine learning infrastructure.
Solid hands-on experience with Ray, Kubernetes (GKE preferred), and multi-GPU orchestration.
Proficiency with GCP services (Compute Engine, GCS, IAM, VPC, etc.).
Strong working knowledge of Python and shell scripting.
Experience managing CUDA-based environments for training and inference with PyTorch.
Familiarity with containerization (Docker) and environment isolation (Conda, virtualenv).
Experience with IaC tools (Terraform, Helm).
Strong troubleshooting skills in distributed environments (networking, storage, job failures, etc.).Nice to Have Experience with LLM training, LoRA fine-tuning, or RLHF pipelines.
Familiarity with FlashAttention, DeepSpeed, FSDP, or other large-scale model optimization techniques.
Knowledge of CI/CD tools (GitHub Actions, ArgoCD) and experiment tracking (e.g., MLflow, Weights & Biases).
Exposure to event-driven compute or serverless functions on GCP.
Ability to write clean internal tooling (e.g., dashboards, CLI utilities).
Candidate-se nesta oportunidade
Se a vaga (29378213488) Engenheiro de Mlops - Remoto em Porto Alegre / RSé compatível com suas expectativas, envie seu currículo agora mesmo.
Oportunidade ofertada por site parceiro do Jobbol, você será encaminhado para a página de registro de candidatura, boa sorte!