Homepage Open Sollicitatie / Inschrijven

DevOps/Platform Engineer

Brussel, Brussel

Platform Engineer

Start Date: 03/08/2026

End Date: 31/12/2027

Location: Brussels (2 days onsite per week)

Work Regime: Full-time

Role Summary:

The Platform Engineer is responsible for the reliable, secure, and stable operation of a high-availability cloud platform, built on Kubernetes and composed of multiple in-house platform components. The role focuses on platform lifecycle management, day-2 operations, incident response, and operational excellence, ensuring that customer-facing Web UIs and APIs remain available, performant, and secure 24/7. The Platform Engineer acts as a technical custodian of the platform, providing a stable foundation on which service teams can safely deploy and operate their workloads.

Primary Objectives:

Maintain platform availability and reliability in accordance with SLOs/SLAs.
Ensure operational readiness of all environments (DEV / TEST / ACC / PROD).
Provide 24/7 operational coverage for critical platform services (via on-call).
Ensure the platform is observable, secure, well-controlled, and documented.
Execute platform changes, upgrades, and maintenance in a predictable and low-risk manner.

Key Responsibilities:

Kubernetes & Runtime Operations

Operate Kubernetes primitives and platform add-ons: Ingress controllers, service discovery, workload identity.
Troubleshoot Kubernetes-related failures: Pod lifecycle issues, networking problems, resource starvation.
Controlled rollouts with rollback plans.

Reliability & 24/7 Incident Response

Participate in the 24/7 on-call rotation for critical services (incident responder).
Lead or contribute to incident triage and mitigation, Root Cause Analysis (RCA), and post-incident action tracking and follow-up.
Maintain and improve runbooks and operational procedures.

Observability & Monitoring

Operate and use the open-source observability platform.
Ensure effective observability across the platform: Metrics, logs, and distributed traces.
Actionable alerts and reduced false positives.
Support incident analysis through correlation and telemetry inspection.

Change, Release & Maintenance Management

Plan and execute platform changes.
Follow structured change management practices.
Stakeholder communication.
Ensure platform changes are documented and auditable.

Security & Compliance (Operational Focus)

Operate platform security controls: RBAC, network boundaries, secret management.
Apply security updates and patches to platform components.
Support vulnerability remediation efforts.
Provide operational evidence for audits and security reviews.

Automation & Operational Improvement

Automate repetitive operational tasks where appropriate.
Reduce operational risk through standardization and documented procedures.
Platform as Code approach (GitOps).

Requirements:

Technical Skills

Kubernetes (Deep Production Expertise)
Multi-cluster architecture & lifecycle management
RBAC & least-privilege design
Network policies & traffic segmentation
Stateful workloads & storage strategy (CSI, PV/PVC)
Autoscaling (HPA/VPA) & resource tuning
Pod Security Standards
Admission controllers
Performance & reliability troubleshooting
Cluster-level debugging (networking, DNS, scheduling, OOM, crash loops)

GitOps & Continuous Delivery

ArgoCD (advanced usage)
App-of-Apps pattern
Sync waves & hooks
Drift detection & reconciliation
Multi-environment promotion workflows
Git-based deployment strategy with version management
Declarative platform design with PR-driven changes
YAML-based CI/CD pipelines with Harness.io
Secure secret handling in CI/CD (with HashiCorp)

Packaging & Configuration

Helm (advanced chart authoring)
Reusable library charts
OCI-based registries
Values layering strategy
Kustomize overlays for multi-environment isolation and strategic patches

Container & Artifact Management

Docker (secure multi-stage builds, optimization)
Harbor (RBAC, replication, vulnerability scanning)
JFrog Artifactory (Docker & Helm registry management)
Artifact versioning & promotion strategy

Secrets & Security

HashiCorp Vault for dynamic secrets with CSI integration
Image vulnerability scanning integration
Supply chain security awareness
TLS & certificate lifecycle management
RBAC governance

Observability & Reliability

OpenTelemetry (metrics, logs, traces)
Prometheus or VictoriaMetrics (recording rules, HA setup)
Loki (log aggregation & LogQL)
Tempo (distributed tracing)
Grafana (advanced dashboards & alerting)
SLI/SLO design & error budget thinking
Alert noise reduction strategy

Networking (Advanced)

TCP/IP & DNS fundamentals
TLS & mTLS concepts
Kubernetes Services, Ingress & Reverse Proxy concepts
East-west vs north-south traffic
API routing & traffic management
Network Policies implementation

Automation

Advanced Bash scripting
Infrastructure automation mindset

Nice to Have

Kong API Gateway (api routing, plugins, authentication, rate limiting)
Redis (operational knowledge: deployment, persistence, clustering, backups)
PostgreSQL (migrations, backups, HA basics, Kubernetes deployment patterns)
MongoDB (replica sets, backups, Kubernetes deployment patterns)
Kargo on top of ArgoCD for release orchestration

Operational Skills

Proven experience in production operations or platform support roles
Ability to work calmly and methodically under pressure
Strong troubleshooting skills across distributed systems
Clear written and verbal communication during incidents and changes
Flexibility to balance daily operations with long-term changes

Ways of Working

Structured, risk-aware, and detail-oriented
Comfortable with operational responsibility and accountability
Strong collaboration with Development teams, Security teams, Product teams
Documentation-first mindset for operational knowledge

Positioning vs Other Roles:

Not a pure SRE role: focus is stability and operations, not reliability engineering.
Not a pure DevOps engineer embedded in product teams.
The role is the operational owner of the platform, in all environments, ensuring they run safely and predictably.