K8s multicluster observability solution

MG 2019 - 2021

Multicluster Kubernetes observability solution.

Development of a multicluster observability solution for a dynamic set of Kubernetes clusters.

Highlights

bespoke orchestrator built in Go
metrics collection agents based on Prometheus
alerting based on Alert Manager
alert aggregation based on Karma
dashboards based on Grafana

The orchestrator was composed of services and K8s operators built in Go that managed configuration and lifecycle of per-customer per-cluster agents and services responsible for metrics collection, aggregation, alerting, and visualisation through dashboards.

Collection agents built on top of Prometheus were automatically generated and deployed whenever a new Kubernetes cluster was created. The clusters were pre-configured via a configuration service to allow the agents to connect and scrape services running inside them through the Kubernetes API proxy.

Metrics from individual Prometheus instances were aggregated through automatically configured and provisioned cortex instances, and then filtered and further aggregated and exposed to per-customer Grafana instances providing visualisation and query capabilities.

Each cluster would also get a dedicated deployment of Alert Manager so that issues could be spotted and reacted to quickly. Per-customer alert aggregation dashboards based on Karma were also automatically maintained to provide better visibility of issues across fleets of clusters, especially useful for customers running multiple clusters.

A service inside the orchestrator was responsible for generating configuration, manifests, and code for the Prometheus workers, for alerts, as well as dashboards, by rendering templates stored and maintained in a number of Git repositories (so generic templates were maintained in one place, with per-customer or even per-cluster customisation possible via additional sources).

Grafana dashboards were also generated with custom statistics tracking so that there was visibility into which dashboards are used often, and which are used rarely by the customers and SRE teams.

Challenges & Solutions

One of the challenges encounteres was an issue with some of our clients where their Prometheus workers would often run out of memory and crash. And these were already running on dedicated hosts and using 128GB or RAM.

Initial mitigation attempts were to optimise metrics and better manage resources, however that was only helping temporarily. The problem was that the initial agent design was not scalable because when clients added more and more clusters and services that needed to be observed, memory consumption of the Prometheus workers would just grow until it started hitting limits.

The solution was to shard the workers to divide the responsibility for individual clusters and services to different workers and separate Prometheus instances. This was implemented in the orchestrator service that generated configuration and managed lifecycle of the workers.

Results & Impact

Accelerated Cluster Deployment: significantly reduced the time required to bring new clusters online, enhancing operational efficiency and responsiveness
Enhanced Resiliency: achieved improved system resiliency through the implementation of an agent-based architecture and a sharded Prometheus setup, ensuring robust and reliable performance
Scalable Monitoring Solution: optimized the scalability and maintainability of alert definitions and dashboards by leveraging templating, allowing for flexible and efficient monitoring across diverse environments
Reactive System Architecture: enabled high reactivity and agility through a Kubernetes-native design and API, facilitating seamless integration with other services and rapid response to dynamic system events like provisioning requests or configuration changes