After spending more than a year in a small operations team in the Health Tech industry running self-managed production Kubernetes clusters, I learned three important lessons:
- Certain tools require a great deal of energy – typically an entire team of operators – in order to conform to best-practices, effectively maintain, and make full use of the tool in question.
- Organizational silos between your operations team and other development teams create complex handoffs when it is time to deploy.
- If your operations team is already backlogged in maintaining a complex tool, troublesome handoffs will slow down your team even further, thus hindering your organization’s ability to deliver software quickly.
In order to prevent other organizations from facing the scenario outlined in lesson #3, I want to propose a Kubernetes maturity model which achieves an organization-wide adoption of Kubernetes by making strategic organizational and technical changes at different points in time. The stages of this maturity model are as follows:
- Bootstrapping and Provisioning
- Team-Wide Adoption
- Project Onboarding
- Developer Enablement
- Cross-Team Adoption
Throughout each stage of this maturity model, I’ll discuss the technical, cultural, and procedural changes required within the organization, and also outline the tools which will help us achieve these changes.
1. Bootstrapping and Provisioning
Managed or Self-managed Cluster?
This is the starting point for most organizations: an ambitious individual in the Operations Team is spearheading the adoption of Kubernetes (from here referred to as k8s) in some capacity. It’s at this point that the organization must decide between a self-managed or managed installation of k8s. Historically, managed installations provisioned and configured the k8s control plane instances on your behalf, allowing you to focus on bootstrapping the worker nodes and actually using k8s. Now however, all of the managed k8s offerings from AWS, Azure, and GCP can provision worker nodes as well, meaning that after you provision an EKS, AKS, or GKE cluster, you can simply connect to the k8s control plane and start creating k8s objects.
Maintaining self-managed k8s clusters can prove to be complex due to the various control plane components that need to be managed, so it is prudent to opt for a managed cluster whenever possible. That being said, some organizations may have compliance stipulations that require all of their instances to pass certain hardening benchmarks and thus be self-managed, or perhaps these organizations are limited to running k8s within their own datacenter’s private cloud.
Managed and Self-Manged: Infrastructure-as-Code
In either case, Infrastructure-as-Code (IaC) is a best practice as it allows for versioning the cloud resources which constitute the k8s cluster. Terraform can be seen as the industry standard in terms of a cloud-agnostic IaC tool, because it has Providers – essentially pluggable API binaries – which support a large variety of Public and Private Cloud platforms. Some organizations still opt to use the IaC tool provided by their Public Cloud Provider (such as AWS CloudFormation in the case of AWS), because in some cases it may be better business decision to use a tool that is maintained and supported by the Public Cloud Provider rather than a third-party tool.
Self-Managed: Configuration Management and Kubernetes Installers
With a self-managed cluster, you are responsible for provisioning and configuring the instances which will run the components that constitute a k8s cluster. For example, you will deploy control-plane instances (etcd instances, k8s control plane instances) and worker nodes using Terraform. These instances may then have some baseline compliance configurations done using a Configuration Management tool such as Ansible after they are provisioned (or alternatively these baseline changes can be created beforehand as part of the virtual machine image using HashiCorp Packer). Then, a k8s installation tool such as Rancher Kubernetes Engine can install all of the k8s components on the newly-provisioned instances and bootstrap the k8s cluster.
Other Kubernetes installation tools such as kops take this a step further and actually deploy these instances without the need for an Infrastructure as Code tool (although kops can integrate with Terraform using its Terraform target).
Cultural and Procedural Implications
At this point, there is only one individual contributor inside the Operations Team performing ad-hoc iterative development for provisioning the k8s clusters for the organization. There could be more than one contributor, but likely still just a small subset of the operations team, where operational knowledge of k8s is not yet widespread, and a formal collaboration process has not yet been established.
2. Team-Wide Adoption
Continuous Delivery: Versioning and Deploying k8s Manifests
Once the individual in the Operations Team has provisioned k8s, they are able to collaborate together in order to create a production-ready cluster.
It’s important that a structured, singular path to pushing changes to the k8s clusters is established, and that at the same time a source of truth for the k8s objects running on the clusters is created in the form of a git repository. A repository should be created on the Version Control System (VCS) provider used by the organization, and a CI/CD pipeline to deploy these manifests should also be created.
Tools such as Spinnaker provide first-class support for deploying manifests or Helm charts to k8s, and also make up for missing Deployment Strategies in k8s such as Blue-green Deployments. That being said, Spinnaker is a CD tool and not a CI/CD tool, and will need to be used in conjunction with a tool capable of CI – such as CircleCI or Jenkins – in order to build Docker images in the later maturity stages.
With VCS and CI/CD set up, the Operations Team can make Pull Requests against their newly-created repositories, iteratively deploying manifests / Helm charts via their newly-created CD pipelines, in order to make their k8s clusters production-ready.
Building a Production-Ready k8s Cluster
At this point, the Operations Team is cultivating their operational knowledge of k8s and working together iteratively in order to create monitoring and logging solutions. These aspects of a k8s cluster are essential if the cluster is to house a service of any business value. A monitoring solution built on the time-series tool Prometheus and its graphing and alerting counterpart Grafana can be used to create alerts on availability and also performance. A centralized logging solution built on-top of an EFK stack (ElasticSearch, Fluentd, and Kibana) is critical for being able to capture and assess application logs in production, but a centralized logging solution can also be used for creating a k8s API audit log and Service Mesh logs later in the final maturity stage.
Pull-based or Push-based Deployments?
Helm charts are more or less the industry standard for when it comes to templating, packaging and deploying manifests to k8s. Helm’s integrated deployment mechanism for Charts is push-based, meaning whether you are deploying Helm Charts using Spinnaker or have a CircleCI step which combines Helm’s templating functionality but uses kubectl’s integrated kustomize deployments, your deployments are push-based. Meaning that if the CD step fails after the PR with the manifest changes is merged to master, the git repository is no longer a source of truth for the cluster.
While not a huge concern for most organizations, those concerned with maintaining a git repository which does not diverge from the actual state of their k8s cluster may be interested in using Weave Cloud, which achieves this using a convergence mechanism combined with pull-based deployments.
Cultural and Procedural Implications
At this point, the Operations Team has established how their k8s clusters are going to be managed by choosing appropriate CI and CD solutions. The team has also collaboratively implemented a centralized logging solution and also a monitoring solution. Throughout this process, they have built up their own knowledge on k8s and are ready to begin onboarding developers who will eventually deploy their services on these k8s clusters.
3. Project Onboarding
Continuous Integration: Containerizing Applications
Kubernetes is a container orchestration tool, so naturally one of the very first tasks undertaken during project onboarding is to containerize the services which are intended to be run on k8s in production. At this point in time, the Operations Team still posses more knowledge on k8s than any other team. A repository specific to the service (or a monorepo specific to the project) should be created, where developers are given the opportunity to submit PRs for their project’s manifests and Dockerfiles, such that they are encouraged to develop these k8s and Docker image configurations themselves rather than delegate these tasks to the Operations Team entirely. The Operations Team will still likely be responsible for setting up the CI pipelines corresponding to the project – for example a CI pipeline which builds these Docker images and pushes them to a registry, and a CI pipeline which performs a helm package against a Helm Chart and publishes it to AWS S3 so that it can be deployed by a Spinnaker CD pipeline which will install the Helm Chart on a k8s cluster.
Organizations may opt to use a Docker registry service provided by their Public Cloud Provider for simplicity, or alternatively use a cloud-agnostic artifact store with an integrated Docker registry. The latter is appropriate when the organization is already using artifact stores such as Sonatype Nexus to host non-Docker artifacts such as Maven packages.
Enabling Developers to Develop k8s Manifests and Container Images
In order to facilitate learning and reduce trial and error when developers are developing their Docker Images and k8s manifests for the first time, local k8s cluster tools such as minikube and kind for both initial manifest development can be leveraged. These tools can also be used to help developers become familiar with the k8s control plane and the different types of k8s objects that their manifests are going to be representing.
While Docker is the standard tool for building OCI-compliant container images, other tools such as podman and buildah are compatible with Docker but contain additional features which may give more flexibility in the initial experimentation and building phase. These tools are not necessarily intended for Continuous Development – which we will describe in the next maturity stage – but they are highly appropriate for in-depth learning and experimentation.
Contrary to popular belief, developers new to k8s cannot simply google every problem they run into when building Docker images or k8s manifests and click the first Stack Overflow link. First and foremost in terms of educational k8s resources is the k8s documentation itself. This documentation ranges from simple YAML examples for the Deployment k8s resource, all the way to complex topics such as how k8s Service networking is implemented using kube-proxy and iptables. Often overlooked are interactive resources such as Katacoda, which help interactively teach topics such as Liveness and Readiness health checks, k8s concepts which developers need to know when creating manifests for their services.
Cultural and Procedural Implications
At this point, services are deployed to production, and the organization has direct business value invested in k8s. The Operations Team is no longer the sole team with knowledge on how to manage a project on k8s, and the the development team is on the right path towards being able to manage the project autonomously but with oversight from the Operations Team. The newly established process formed by the CI and CD pipelines requires internal reviews before integration, and external reviews from the Operations Team before deployment. This review process will reduce risk, and at the same time, the shared knowledge of the tools being used and the collaborative nature of the project will reduce complex handoffs that could potentially slow down software delivery.
4. Developer Enablement
Adopting Kubernetes in Local Development
In this maturity stage, developers are beginning to take advantage of the k8s ecosystem and are adopting it even in local development. Tools such as Skaffold and DevSpace create a local development environment by running a local k8s cluster which can run a project’s manifests, and build and run containers without the need to set up a container registry. DevSpace even enables hot reloading of applications without the need to rebuild containers.
Improving Application Insights using Tracing
The k8s ecosystem includes tools which are native to k8s and can be used by developers to gain insight into their software as it runs in staging or production. This includes k8s-native Application Tracing tools such as Jaeger. This tool is deployed to k8s, and, once applications are instrumented using Jaeger’s provided libraries, can give insights relating to issue root cause analysis and performance, among other things.
Static Vulnerability Scanning
More often than not, the base Docker image of an onboarded project may not be updated since its onboarding. For example, a Docker image with a base image of Alpine 3.6 will work perfectly in production, but contain unaddressed CVEs if not maintained by having its base image replaced with a more recent image. While containers are arguably smaller attack vectors than virtual machines, this may prove to be an issue for compliance audits which may stipulate static vulnerability scanning even on containers. More importantly for this maturity stage, however, is encouraging developers to continuously maintain their Docker images by planing static vulnerability scanners at the end of the project’s CI jobs. Tools such as Clair are able to detect the aforementioned CVEs on the base image and will encourage developers to update the base image to the latest Alpine image, for example. Tools such as trivy are able to assess application dependencies for vulnerabilities, which will also encourage continuous maintenance and prevent Docker images in the project from becoming “stale”.
Cultural and Procedural Implications
At this point, the development team in the organization has acquired enough practical knowledge to independently manage their project in k8s and onboard new projects, while maintaining operational and security oversight from the Operations Team.
In addition, the development team has started to form a culture which is enthusiastic about the k8s ecosystem and is able to adopt tools from that space in order to improve the software they are building.
5. Cross-Team Adoption
Governing Service to Service Traffic
As multiple development teams gain expertise in k8s and begin to embrace best-practices in the k8s sphere, a new cross-team standard of operational excellence and culture of adhering to this standard emerges in the organization. One of the most evident signs of widespread k8s adoption within an organization is the presence of services from multiple development teams residing together all on one cluster or on a set of federated clusters.
When this happens, it becomes prudent to implement a service mesh in order to catalogue these services properly, gain visibility into the flow of network traffic between services (east-west traffic), and also govern which services can communicate with one another. Tools such as Linkerd and Istio enable the creation of multi-cluster k8s service meshes. Consul Connect, which has traditionally existed outside of the k8s space, now has a k8s integration. This means that one can use Consul Connect to create a Service Mesh that has services outside of k8s as well as services residing on k8s, although Istio can catalogue services outside of k8s as well the use of Istio Proxy.
Maintaining the Service Mesh would ultimately be the Operation Team’s responsibility, however development teams would be incentivized to contribute to the repository containing the Service Mesh manifests through Pull Requests, otherwise their services would not be able to connect to other existing services! This enforces the organization-wide culture of collaborating in order to uphold a cross-team operational standard.
Most development teams which use static secrets (static database credentials, static IAM credentials) are wise enough to realize that what they are doing is not ideal: These credentials will eventually need to be rotated, even though the process of doing so is a burden and the incentive to do so is very small. Furthermore, anyone with read access to these credentials could abuse them if they chose to do so. There tends to be an illusion that these are problems that need to be solved in the future and not in the present, and in the meantime, everyone on the team simply has to be careful. With every team in the organization thinking this way, there is going to be a large number of unrotated, poorly secured static credentials, and almost no incentive to rectify this situation.
Use of centralized Secrets Management Solutions such as HashiCorp Vault is not only more secure than the use of static credentials, but also more convenient. Sure, static credentials stored as k8s Secrets objects are convenient in the sense that nothing needs to be done once these objects are created. However, their rotation is troublesome and their security implications – as long lived secrets exposed to humans who have access to see them – are cumbersome. Vault requires an authentication and retrieval method for the dynamic secrets to exist before they can be used in the first place. Once this method is established, secrets are retrieved and rotated transparently. Because they pose less operational overhead than static secrets, there is a larger incentive to use them.
HashiCorp provides a Vault Agent Sidecar injector in the form of a Kubernetes Mutation Webhook Controller, and Vault itself can be deployed onto k8s using the Vault Helm Chart. The Operations Team should be responsible for managing the Authentication Methods, Secrets Engines, and Vault Policies. All of these configurations are manageable by the Terraform Vault provider, which means that a Terraform configuration repository and a CD pipeline can be created by the Operations Team, such that development teams can make PRs in order to request the secrets that they require for their services.
Organization-wide adherence to runtime security using a tool native to k8s is a sign that an organization is highly mature in their adoption of k8s. While not providing any immediate benefit to development teams who run their services on k8s, tools such as Falco provide organizations with audit logs which they can leverage when undergoing security compliance audits and in their day-to-day security monitoring.
Cultural and Procedural Implications
In this last stage of the k8s maturity model the organization leverages the operational knowledge established within each development team in the previous stage, and expands on this by upholding an organization-wide standard of operational excellence which all teams have adopted a culture of striving towards.
Every organization is different. There may not be a single Operations Team: there may be an SRE team, a Security Team, a DevOps team, and a Networking team all working together. The original attempt to bring k8s into an organization may not come from a single member of the Operations Team – it may come from two members, and it may come from a development team instead. What this maturity model aims to highlight is generally true regardless, knowledge must be shared within the organization, collaboration must be encouraged, each team must have an incentive to use the tools in question, and each tool must elevate the team rather then encumber it. If these conditions are met, the culture that started with one individual can spread to the rest of their team, then to another team, and finally to the entire organization, which can collectively achieve a culture of operational excellence which leverages the vast plethora of tools birthed out of the k8s ecosystem and the organizational processes established while gradually adopting these tools. This culture ultimately enables the organization to deliver software in a low risk, high velocity fashion, and is the epitome of what the DevOps movement attempts to achieve.