We are seeking a highly experienced Platform Engineer who specialises in Observability, primarily focused around the open-source Grafana observability stack. In this role, you will be instrumental in managing the lifecycle of our observability platform, ensuring robust monitoring, logging, tracing and profiling for our applications running on Kubernetes. You will contribute to the architecture, implementation, and continuous improvement of our observability pipeline, enabling teams to monitor and optimise system performance efficiently.:
Implementing OpenTelemetry within application codebases and managing Otel tooling and services.
Architect, implement, and manage an observability stack based on Grafana, Prometheus, Loki, Mimir, Tempo, and other related technologies within a Kubernetes environment.
Ensure comprehensive monitoring, logging, and tracing coverage for microservices and Kubernetes clusters.
Collaborate with development and platform teams to create meaningful dashboards, alerts, and automated incident responses.
Continuously improve the observability platform for scalability, multi-tenancy, and reliability.
Support and mentor teams in adopting best practices for instrumentation and monitoring.
Implement automation and infrastructure-as-code practices for managing observability infrastructure using Terraform, Helm, and CI/CD pipelines.
Integrate observability tooling with other cloud services and on-premise infrastructure as required.
Ensure security and compliance standards are met, focusing on auditability and data integrity within the observability stack.
You will have a strong passion for observability. You will have a strong “customer first” mentality and be comfortable in assisting developers of all levels. You will have excellent problem-solving and troubleshooting skills
Extensive experience working with Kubernetes, particularly in managing observability for containerised applications.
Deep knowledge of the open-source Grafana stack, including Mimir, Loki, Tempo, and Beyla.
Experience building and managing observability pipelines in a cloud environment (AWS, GCP, or Azure).
Experience utilising SaaS-based observability platforms such as New Relic
Strong automation skills and experience with IaC tools such as Terraform and Helm.
Proficient in scripting and programming languages such as Node, Python, Go, or Shell.
A customer-first mentality, with strong problem-solving and troubleshooting skills.
Experience supporting development teams with production monitoring and root cause analysis.
AWS, Azure, or GCP certifications are highly regarded.
See more jobs at Nine