Medior Site Reliability Engineer
Start Date: 13/04
End Date: 31/12/2026
Location: Brussels
Regime: Full-time
Application Deadline: 26/03
Service Description:
We are looking for a Site Reliability Engineering service within our Engineering chapter team. The goal is to ensure the reliability, scalability, monitoring, and performance of our on-premises services in the ERA product organization. Responsibilities will include designing, implementing best practices, and managing our infrastructure. The role involves working within cross-functional teams to improve systems and processes and ensure uptime and efficiency.
Responsibilities:
- Design and maintain monitoring infrastructure
- Create custom dashboards, alerts, and visualization solutions
- Implement distributed tracing and log aggregation systems
- Establish monitoring best practices and SLI/SLO frameworks
- Maintain security compliance for on-premises monitoring tools
- Automate deployment and configuration management
- Collaborate with development teams on application instrumentation
- Participate in on-duty rotations
Requirements:
Core Technologies:
- Advanced Grafana, Prometheus (PromQL), OpenTelemetry, Elasticsearch
Infrastructure:
- Linux administration, networking, on-premises security
Programming:
- Python, Bash, or Go for automation
Experience:
- 3+ years in monitoring/observability
- 2+ years with Grafana/Prometheus in production
- Strong Linux system administration experience
- Proven track record with on-premises infrastructure solutions
Security:
- Enterprise security practices, compliance requirements
Additional Skills:
- Ability to balance technical trade-offs with business needs and prioritize effectively
- Participation in on-duty rotations (24/7 Incident support)
Key Deliverables:
- Reduced MTTD/MTTR through effective monitoring
- Comprehensive observability across all systems
- Automated monitoring, deployment, and management
- Security-compliant monitoring practices
Languages:
- English (C1)
- Extra Languages: German, French, Dutch