For 1 of our clients we are looking for a Cloud-Native Platform Technical Lead (Kubernetes/AWS)
Overview
This service provides a senior technical lead to design, operate, and continuously improve our cloud-native platform. The Technical Lead is accountable for the reliability, security, and scalability of Kubernetes-based workloads on AWS and on premises, while partnering with the Product Owner (PO) and cross-functional teams to deliver an agreed roadmap.
Key Responsibilities
• Operate and maintain Kubernetes clusters: day-to-day administration, upgrades, patching, capacity planning, backup/restore, and lifecycle management of cluster components and add-ons.
• Design and implement cloud-native infrastructure: build and evolve AWS-based platform capabilities using infrastructure-as-code and automation to enable secure, repeatable environments.
• Monitoring and troubleshooting: implement/operate observability (metrics, logs, traces), respond to incidents, perform root-cause analysis, and drive problem management to prevent recurrence.
• Security and compliance best practices: apply secure configuration baselines, identity and access controls, network segmentation, vulnerability management, and audit-ready operational practices.
• Documentation: write and maintain technical documentation (architecture, runbooks, SOPs) and end-user documentation (how-to guides, onboarding, FAQs).
• Networking & platform integration: support Kubernetes networking (CNI, Cilium, services/ingress, DNS), connectivity to enterprise services, and performance troubleshooting across the stack.
• Collaboration: partner with application teams, security, networking, and operations; provide technical guidance, reviews, and enablement.
• Roadmap delivery: work with the PO to define and deliver a prioritized roadmap (features, technical debt reduction, risk remediation), including effort sizing and milestones.
• Continuous improvement: establish and evolve platform standards, templates, and best practices to improve developer experience, reliability, and cost efficiency.
Required Qualifications
• Proven hands-on experience leading and operating production Kubernetes environments (cluster operations, upgrades, troubleshooting, and performance tuning).
• Hands-on experience administering Rancher-managed Kubernetes, including RKE2 cluster provisioning, upgrades, and lifecycle operations.
• Strong Linux systems administration skills (networking, storage, security hardening, scripting/automation).
• Strong understanding of the networking stack (TCP/IP, DNS, TLS/certificates, load balancing, firewalls, proxying, routing) and how it applies to Kubernetes.
• Experience with AWS services used in cloud-native architectures (e.g., IAM, VPC, EC2, EKS, Load Balancing, Route 53, CloudWatch) and operating within enterprise guardrails.
• Ability to implement and maintain monitoring/alerting and to lead incident response and root-cause analysis.
• Experience with GitHub for version control and GitHub Actions for CI/CD.
• Demonstrated ability to produce high-quality technical and end-user documentation.
• Strong communication skills and comfort collaborating with cross-functional stakeholders; able to translate between technical and non-technical audiences.
Preferred Qualifications
• Experience with configuration/infrastructure-as-code and automation (e.g., Terraform, Crossplane, Puppet) and CI/CD practices (e.g., GitHub Actions).
• Experience with GitOps and Kubernetes application delivery tooling (e.g., Fleet, Argo CD, Flux) and Helm/Kustomize.
• Experience with container security, policy enforcement, and compliance tooling (e.g., image scanning, OPA/Gatekeeper/Kyverno, secrets management).
• Background with SRE practices (SLIs/SLOs, error budgets) and post-incident learning.
• Relevant certifications (e.g., CKA/CKAD/CKS, AWS certifications) are a plus.
Key Deliverables / Success Measures
• Stable, well-maintained Kubernetes platform with documented upgrade/patch cadence and reduced operational risk.
• Monitoring/alerting coverage with clear on-call/runbooks and measurable improvements in incident response and time-to-recovery.
• Implemented security and compliance controls (access management, vulnerability remediation process, audit-ready documentation).
• Clear platform architecture and operational documentation (runbooks, SOPs, onboarding guides) kept current.
• Enhanced platform user experience (developers and end users) through streamlined onboarding, clearer self-service documentation, and reduced friction for common workflows.
• A jointly managed roadmap with the PO, delivered to agreed milestones and including technical debt and risk remediation.
Details
• Expected start date: Negotiable, but preferred around June 1st
• Work location: Zwijnaarde – Ghent, with a minimum of 2 days on-site
• Reporting/working relationship: works day-to-day with the Product Owner; collaborates with Platform Engineering and Application teams.