K8s platform provisioning
Kubernetes platform provisioning system.
Design and implementation of a Kubernetes platform provisioning system using ArgoCD, ClusterAPI, and GitLab to streamline cluster management and deployment.
Highlights
- K8s platform provisioning using ClusterAPI
- a hierarchy of Helm charts to install sets of essential services
- security practices to enable both dedicated as well as multi-tenant clusters
- AWS integration (IRSA, EFS, EBS, IAM, S3, Glue, etc.)
- self-service cluster management for customer teams
- maximum automation
Technical Details
The architecture of the system followed the standard ClusterAPI architecture of Management Clusters (MC) that each managed the lifecycle of multiple Workload Clusters (WC).
The Management Clusters were running all the necessary ClusterAPI controllers and providers, as well as ArgoCD.
Cluster applications that included ClusterAPI resource manifests for Workload Clusters were installed on the Management Clusters. These defined details like cloud provider resource configurations, machine templates, node pool settings, etc.
Cluster applications also included a cluster autoscaler running in ClusterAPI mode so that the Workload Clusters could scale up or down depending on usage.
They also included applications for managed system components (e.g. ingress controller, prometheus, storage controllers, secrets controllers, CNI).
With that, the ClusterAPI controllers on the Management Clusters processed relevant resources and ensured the corresponding cloud resources (like VM instances) existed in the correct state, and ensured they were bootstrapped and joined into the cluster.
Once the Workload Cluster was up the Management Cluster ArgoCD installed the system components onto the WC, which includes a WC instance of ArgoCD.
And this WC ArgoCD was preconfigured to watch application repos of the cluster owner, which allowed them to deploy their own services into the cluster.
The maintenance and operation of the platform followed the GitOps workflow, with every configuration change, deployment, or upgrade performed in code first, submitted as a merge request, approved, merged, and then picked up by ArgoCD and automatically applied.
Challenges & Solutions
One of the main challenges was automating bootstrapping of services on new clusters, especially when they needed inputs or secrets that could only have been created when the service was already running. One example of that was signing CA certificates for an in-cluster Vault with the parent intermediate CA. This required Vault to be started to generate a CSR, which needed to be sent for signing and then imported back into Vault.
In many cases such issues could be alleviated by separating out secrets and pre-generating some configuration with Terraform, in remaining cases there was no choice but to resort to writing supporting scripts or tools that had to be run manually as part of bootstrapping.
After initial setup though, deploying applications with secrets could be fully automated by utilising the in-cluster Vault, External Secrets Operator, and AWS IRSA.
Results & Impact
The provisioning system significantly reduced time spent on maintenance and upgrades of the clusters to the point where it was possible to keep up with upstream Kubernetes releases, whereas such upgrades were only done once in a couple of years before.
It also drastically improved the Kubernetes team's capacity to handle requests. The platform configuration was well documented, which allowed many of the changes to be self-service, including, for example, node pool reconfigurations, and only required reviewing and approving merge requests in GitLab by an authorised member of the Kubernetes team.