Senior MLOps Engineer
Location: Brussels
Preferred Start Date: ASAP
Language: English
Employment Type: Full-time
Work Arrangement: Hybrid (2 days on-site per week, preferably Monday and Thursday)
Key Responsibilities:
Engineering & Technical Leadership
- Lead technical decisions for AI operational workloads, including deployment, automation, monitoring, drift detection, and retraining pipelines.
- Ensure adherence to modern software engineering best practices, including code quality, testing, CI/CD, observability, and security-by-design.
- Review and guide Python codebases to ensure maintainability and performance.
- Provide mentorship to engineers and support continuous improvement within the squad.
Operational Excellence & Reliability
- Oversee the full operational lifecycle of AI models, including deployment, monitoring, performance tracking, drift detection, and retraining cycles.
- Guarantee availability and reliability targets for live models.
- Participate in the on-call rotation and drive incident response, mitigation, and RCA improvements.
Security, Risk & Compliance
- Ensure all squad deliverables comply with operational and security standards.
- Enforce secure practices for deployment, secret management, access control, and data handling.
- Collaborate with governance, security, and platform teams to align on internal policies and controls.
Automation & Infrastructure Management
- Lead the squad’s implementation of automation across the stack, including Jenkins CloudBees, CDRO, Ansible, and Helm-based deployments into OpenShift.
- Manage secret management via Vault and storage on S3 Cloudian.
- Monitor via Ansible, dashboards, and operational metrics.
- Identify opportunities to increase automation maturity, such as fully automated retraining and automated drift detection.
Cross-Squad Collaboration
- Work closely with platform/infrastructure squads, data science squads, and governance, security, responsible AI, and architecture teams.
- Translate tribal OKRs into actionable technical plans for the squad.
Required Skills & Experience:
Bachelor’s or Master’s degree in Computer Science, Software Engineering, Information Security, Data Engineering, or a related technical field.
Core Technical Skills:
- Strong hands-on experience in Python for production-grade systems.
- In-depth knowledge of Kubernetes/OpenShift and Helm deployment patterns.
- Experience with CI/CD pipelines using Jenkins CloudBees and CDRO.
- Strong understanding of Ansible for automation and orchestration.
- Experience with Vault for secrets management and secure configuration.
- Solid understanding of object storage (S3/Cloudian) and file-based AI workloads.
Security & Compliance:
- Proven experience applying secure engineering standards in production environments.
- Knowledge of operational risk controls, incident processes, and audit/traceability requirements.
AI/ML Lifecycle (Nice to Have but Strongly Appreciated):
- Experience working with data science workflows, including model training, inference, and validation.
- Understanding of drift detection, model retraining triggers, and automated pipelines for ML model lifecycle.
- Exposure to MLOps principles and production practices.
Leadership & Collaboration:
- Demonstrated experience providing technical leadership within a squad or engineering team.
- Strong communication skills to interact across architecture, data science, security, and Ops layers.
- Ability to translate complex technical topics into actionable squad-level initiatives.