The Power Of Many: ReplicaSets
The ReplicaSet is a very useful basic building block in Kubernetes that other objects, like the Deployment object, rely on. As a kind of Pod manager running in your cluster, a ReplicaSet makes sure the desired number and type of a certain Pod is always up and running. Its functionality is based on the notion of desired vs. observed state, so it also provides a fantastic opportunity to talk about the basics of reconciliation loop awesomeness.
In case you have taken a look at some of the manifests files used in scope of the previous blog posts (such as this one, for example), you’ll no doubt have noticed the object employed to run the sample workload is the Deployment object. The way it’s set up – having a Pod template baked into it – may seem to imply the Deployment manages these Pods directly, but that’s not the case – in fact, the Deployment manages and configures a ReplicaSet, and it is the ReplicaSet that manages the Pods. As it turns out, in Kubernetes, the ReplicaSet is a basic building block for running and managing workloads that other, higher-level objects – such as the Deployment object – rely upon. In order to lay the foundation for covering the latter in future content, the following sections will introduce you to the ins and outs of the ReplicaSet object – the problem it solves, how it works, its specification, and how to interact with it.
- Limits Of Pod Manifests
- The Pod Manager You Can Trust
- The ReplicaSet Spec
- Working With ReplicaSets
- Cleaning Up
Limits Of Pod Manifests
Imagine you had a microservice doing some kind of search for images (not an OCI container image but, you know, an image to look at) and you wanted to run five instances of it. We’ve learned previously that Pods are the basic unit for running workloads in Kubernetes, so you might be tempted to write a PodSpec like the following…
apiVersion: v1 kind: Pod metadata: name: image-search-1 spec: containers: - name: awesome-image-search-service # Pretend this were an image ref to your awesome image search service image: antsinmyey3sjohnson/hello-container-service:1.0 ports: - name: http containerPort: 8081 resources: limits: cpu: 200m memory: 128Mi
… and submit it to the API server, which would give you one instance of your workload. You could then add four more specs just like the one above, modify the Pod name so it becomes unique in the namespace, and then submit those four, too. Voilà, five instances of your microservice!
- Table of Contents
Of course, this way of creating the desired number of instances – or replicas – has its drawbacks:
- It’s tedious. Imagine you had to create 20 replicas for 10 microservices – manually creating (and updating!) those PodSpecs would get old pretty quickly even with a powerful text editor.
- It’s error-prone as copy-paste issues would quickly make their appearance.
- The Pods created in this fashion won’t get automatically restarted.
This last point is the most noteworthy. Pods are one-off singletons, and Kubernetes treats them as ephemeral and disposable units (unless governed by a StatefulSet, which we won’t look at in this blog post). When you created those Pods manually by submitting a PodSpec to the API server, there wouldn’t be a kind of manager making sure the desired number of replicas is always running – in fact, there wouldn’t even be a way to tell Kubernetes about what desired number means in that case to begin with. Essentially, that manager part would be yourself – constantly monitoring your herd of Pods and making sure the desired number is always up and running.
Wouldn’t it be just great to have a cluster-internal entity lifting that burden from your shoulders?
The Pod Manager You Can Trust
This is where the ReplicaSet comes in – it continuously monitors your set of Pods and makes sure the number of Pod replicas running always matches the desired number of replicas. Thus, a ReplicaSet could be seen as a kind of “Pod manager” watching the entire cluster for your Pods, spawning more or deleting some of them whenever the desired number of replicas has changed in one or the other direction. For example, if a Pod under management by a ReplicaSet crashes or gets teared down, the ReplicaSet will submit a request to create a new one to the Kubernetes API server in order to re-align desired and current state. This is the reason why a ReplicaSet should be used even if you want to run only a single Pod.
Desired And Observed State
If you were given the task of making sure a certain type of Pod is always running a certain number of times and adjust that number if need be, you’d do so by simply comparing the number of Pod replicas currently running with whatever number you were told is the desired number. ReplicaSets work in exactly this way – they observe the current state and compare it with desired state, which – in the case of ReplicaSets – is the type of Pod plus the number of replicas running.
The notion of desired vs. observed or current state is expressed in Kubernetes in the form of so-called reconciliation loops. Because those loops and the concept they embody are so central to the self-healing capabilities of Kubernetes, they will be explored in depth in their own blog post, but here’s the basic idea: A Controller implements a reconciliation loop that constantly compares the desired state (which is the state provided to the Kubernetes API by the user or a machine-based actor) to the current state (which is the state currently observed in the cluster). There are many different Controllers that fulfil different responsibilities (and users can even write their own), so what precisely said state comprises can vary significantly between Controllers. What they have in common is that they are constantly running and trying to align the current state with the desired state, if possible.
How A ReplicaSet Finds Its Pods
A previous blog post introduced labels and label selectors as an important concept in Kubernetes enabling its objects to form meaningful relationships while maintaining loose coupling. A use case for labels and label selectors we’ve already explored is how a Service finds the Pods it should forward traffic to, and you’ve probably guessed that the relationship between ReplicaSets and the Pods they manage is expressed by means of labels, too. Beyond that, with the introduction of the
metadata.ownerReferences property, the notion of ownership was added, thus somewhat increasing the coupling between owner and owned object (here: ReplicaSet and Pods, respectively), though on the other hand, one might argue the ownership reference is merely a technicality since the way the ownership is established is still based on the same label mechanism and thus does not increase coupling.
Coupling discussions aside, let’s take a look at a small example. As always, there’s a little manifests file waiting for you over on my GitHub. We’ll take a more detailed look at its contents a bit further down the line, for now, it’s sufficient to know the file defines a Namespace and a simple ReplicaSet. Let’s submit the file to the API server and take a look at the results:
# As always... alias k=kubectl # Apply manifests file $ k apply -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/simple-replicaset-with-namespace.yaml # View created objects $ k -n replicaset-example get all NAME READY STATUS RESTARTS AGE pod/hello-replicaset-8wp8k 1/1 Running 0 3s pod/hello-replicaset-ptw7c 1/1 Running 0 3s pod/hello-replicaset-p9ljr 1/1 Running 0 3s NAME DESIRED CURRENT READY AGE replicaset.apps/hello-replicaset 3 3 3 3s
CURRENT columns the output displays for the ReplicaSet are a result of the ReplicaSet Controller’s workings, which has aligned the current state with the desired state. With the three Pods in place, we can take a look at the ReplicaSet’s label selector and use it to find its Pods:
# Query label selector $ k -n replicaset-example get rs hello-replicaset -o custom-columns=NAME:.metadata.name,LABELS:.spec.selector.matchLabels NAME LABELS hello-replicaset map[app:hello-replicaset] # Use label selector to retrieve Pods under management by ReplicaSet $ k -n replicaset-example get po --selector="app=hello-replicaset" NAME READY STATUS RESTARTS AGE hello-replicaset-8wp8k 1/1 Running 0 86s hello-replicaset-ptw7c 1/1 Running 0 86s hello-replicaset-p9ljr 1/1 Running 0 86s
ReplicaSets find their Pods in a very similar way – they query the API server for a list of all Pods in the current namespace, remove inactive Pods from the returned list, and then filter for Pods whose labels match their label selector. In addition to that, the ReplicaSet will try to claim all Pods thus identified, meaning the ReplicaSet will acquire all Pods matching its label selector that don’t have a Controller-type owner reference in their list of
metadata.ownerReferences list yet by inserting themselves into this list. Although the ownership reference is established through the label mechanism, it is ultimately the ownership link the ReplicaSet uses to monitor the acquired set of Pods, rather than the label-based querying itself. Beyond this, the Controller-type owner reference is important in the context of Garbage Collection and for preventing different Controller implementations from fighting over the same or an overlapping set of Pods.
The ReplicaSet Spec
As we’ve previously uncovered, the preferred way to create and modify objects in Kubernetes is by submitting a plain-text specification of their desired state to the API server, and the ReplicaSet is no exception to this rule. Like all specifications, a ReplicaSet specification, too, requires the
metadata.name properties, as well as a
spec section. Within the latter, the
spec.template properties are mandatory. Thus, a minimal ReplicaSet spec could look like the following (this manifest is not entirely equal to the one contained in the file you’ve previously applied – in addition to what you see below, the latter defines
spec.replicas as well as resource limits for the container, but those properties are not mandatory, so they do not appear in the “minimal” ReplicaSet shown below):
apiVersion: apps/v1 kind: ReplicaSet metadata: name: hello-replicaset namespace: replicaset-example spec: selector: matchLabels: app: hello-replicaset template: metadata: labels: app: hello-replicaset spec: containers: - name: hello-service image: antsinmyey3sjohnson/hello-container-service@sha256:e9de17b4fbfc6a1f52d07609e285dab3818412016200f189c68e441bf23effb3 ports: - name: http containerPort: 8081
The two mandatory properties in the
spec section perform the following tasks:
spec.selector: Means for the ReplicaSet to find and acquire its Pods for monitoring and managing them.
spec.template: One dimension of what the ReplicaSet Controller understands as state. If the observed number of replicas is lower than the desired number of replicas, new Pods will be created based on this template, which therefore defines the type of Pod this ReplicaSet manages.
You might be wondering why
spec.replicas is not among the mandatory properties – after all, it defines the second and remaining dimension of state as understood by the ReplicaSet Controller. Of course, a ReplicaSet needs a well-defined number of replicas (even if that number is zero), so if omitted, the value will default to one.
Working With ReplicaSets
In the following sections, we’re going to interact with the previously created ReplicaSet in order to explore its characteristics and capabilities (in case you haven’t created the ReplicaSet yet:
k apply -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/simple-replicaset-with-namespace.yaml).
describe command provides us with a lot of useful details about a ReplicaSet:
# Get list of ReplicaSets $ k -n replicaset-example get rs NAME DESIRED CURRENT READY AGE hello-replicaset 3 3 3 43s # Retrieve ReplicaSet details $ k -n replicaset-example describe rs hello-replicaset Name: hello-replicaset Namespace: replicaset-example Selector: app=hello-replicaset Labels: <none> Annotations: <none> Replicas: 3 current / 3 desired Pods Status: 3 Running / 0 Waiting / 0 Succeeded / 0 Failed Pod Template: # ... omitted for brevity Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 62s replicaset-controller Created pod: hello-replicaset-8lm9n Normal SuccessfulCreate 62s replicaset-controller Created pod: hello-replicaset-nsbwp Normal SuccessfulCreate 62s replicaset-controller Created pod: hello-replicaset-mzcdp
Among others, this output informs about the state of all replicas managed by this ReplicaSet, as well as the label selector it uses to acquire its Pods. What’s also interesting is the list of events, describing the actions of the ReplicaSet. Above, upon fresh creation of the ReplicaSet, it has created three replicas, and so there are three Pod creation events. If you delete a Pod acquired by this ReplicaSet, it will notice it’s down one replica, and spawn a new one, in which case the list of events will contain another Pod creation event:
# Delete Pod $ k -n replicaset-example delete pod hello-replicaset-nsbwp pod "hello-replicaset-nsbwp" deleted # Query ReplicaSet description again $ k -n replicaset-example describe rs hello-replicaset # ... other output omitted Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulCreate 5m51s replicaset-controller Created pod: hello-replicaset-8lm9n Normal SuccessfulCreate 5m51s replicaset-controller Created pod: hello-replicaset-nsbwp Normal SuccessfulCreate 5m51s replicaset-controller Created pod: hello-replicaset-mzcdp Normal SuccessfulCreate 83s replicaset-controller Created pod: hello-replicaset-vw9p8
Similarly, all errors the ReplicaSet faces will be listed here.
The central capability of a ReplicaSet – enabled by its reconciliation loop, which understands the number of desired and currently available replicas as one dimension of its state – is that it can spawn new replicas or delete existing ones in case the number of available replicas deviates from the number of desired replicas. Thus, to scale a ReplicaSet up or down, the only thing we have to do is adjust the number of desired replicas, and there are two ways for achieving this: imperative scaling and declaratively adjusting the ReplicaSet spec.
kubectl client offers the
scale function for all objects that know the notion of number of replicas (Deployment, ReplicaSet, ReplicationController, and StatefulSet). In our case, we want to scale a ReplicaSet:
# Scale ReplicaSet from 3 to 5 replicas $ k -n replicaset-example scale rs hello-replicaset --replicas=5 replicaset.apps/hello-replicaset scaled # View list of Pods $ k -n replicaset-example get po --selector="app=hello-replicaset" NAME READY STATUS RESTARTS AGE hello-replicaset-mzcdp 1/1 Running 0 17m hello-replicaset-8lm9n 1/1 Running 0 17m hello-replicaset-vw9p8 1/1 Running 0 13m hello-replicaset-6tfq8 1/1 Running 0 11s hello-replicaset-6g476 1/1 Running 0 11s
As you can see, the ReplicaSet has spawned two new replicas based on the Pod template defined by the
spec.template property in its specification (and, in accordance to this, the event list the
kubectl describe command yields for this ReplicaSet contains two new successful Pod creation events).
Scaling a ReplicaSet in this way is quick and very straightforward, but its imperative nature can be problematic. For example, let’s say you had two Kubernetes clusters and that one is supposed to mirror the state of the other such that the two Kubernetes clusters are exact copies of one another (one of my current clients does this with four clusters), and let’s further assume a GitOps approach is employed such that declarative descriptions of desired cluster state are checked into version control and a continuous delivery tool like Argo CD makes sure the cluster state always mirrors the declarative state description. If your ReplicaSet on cluster A is hit with a sudden increase in load and you use the imperative
kubectl scale command to quickly spawn more Pods, the state on cluster A not only deviates from the state in cluster B, but also from the state described in version control. So, if the load persists and the next synchronization takes place, your ReplicaSet will be scaled down again. The much safer bet is therefore to perform scaling declaratively.
Declarative scaling simply means to say that the number of replicas is provided in the ReplicaSet specification, which is the declarative, plain-text description of the ReplicaSet’s desired state. The previously introduced manifests file defines three replicas, so after having imperatively scaled the ReplicaSet to five in scope of the section above, it will be scaled back down to said three if you apply the file again:
# Apply manifests file $ k apply -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/simple-replicaset-with-namespace.yaml namespace/replicaset-example unchanged replicaset.apps/hello-replicaset configured # Retrieve Pod list $ k -n replicaset-example get po --selector="app=hello-replicaset" NAME READY STATUS RESTARTS AGE hello-replicaset-mzcdp 1/1 Running 0 57m hello-replicaset-8lm9n 1/1 Running 0 57m hello-replicaset-vw9p8 1/1 Running 0 53m
This means if you wanted to scale the ReplicaSet by changing its declarative state description, you’d simply have to adjust the number of replicas in the
# Download manifests file $ curl https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/simple-replicaset-with-namespace.yaml > replicaset.yaml # Increase number of replicas to 5 $ sed 's/replicas: 3/replicas: 5/' replicaset.yaml > adjusted-replicaset.yaml # Apply adjusted manifests file $ k apply -f adjusted-replicaset.yaml namespace/replicaset-example unchanged replicaset.apps/hello-replicaset configured # Check number of Pods $ k -n replicaset-example get po --selector="app=hello-replicaset" NAME READY STATUS RESTARTS AGE hello-replicaset-mzcdp 1/1 Running 0 71m hello-replicaset-8lm9n 1/1 Running 0 71m hello-replicaset-vw9p8 1/1 Running 0 67m hello-replicaset-tmdkb 1/1 Running 0 21s hello-replicaset-fqjxj 1/1 Running 0 21s
Coming back to our previous example where two Kubernetes clusters needed to be kept in sync: If you modified the declarative description of the desired number of replicas, updated it in version control, and let Argo CD synchronize the state, there wouldn’t be any room for errors such as updating one cluster and forgetting about the other, or an imperatively created state deviation getting overwritten by an Argo CD synchronization. Updating the declarative state description is therefore a lot more elegant and should be preferred over making changes imperatively.
This is a good opportunity to answer a question you might have had in the context of scaling: Which replicas get deleted if I scale down a ReplicaSet? The general response to this is that the ReplicaSet’s reconciliation loop is based on the assumption that the Pods it manages are literally replicas, i.e. one Pod out of this set is entirely the same as any other from this set with regards to its state (which, in turn, is the reason why a ReplicaSet works great for stateless or nearly stateless applications, but is not suited to running stateful applications). Thus, a ReplicaSet can delete any of those Pods when the number of currently running replicas is higher than the number of desired replicas. However, the implementation of the ReplicaSet Controller suggests that younger replicas are preferred for deletion, i.e. younger replicas will be deleted first (for Pod deletion, see this function in the Controller’s source code).
The fact that ReplicaSets acquire their Pods based on their labels enables two interesting use cases: pod adoption and pod quarantining.
I’ve prepared another manifests file defining a Namespace, a ReplicaSet, and a Pod. In particular, the number of desired replicas is three. We’re going to look at this file in just a minute, but let’s apply it to our cluster first and see what happens:
# Apply manifests file $ k apply -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/pod-adoption-example.yaml namespace/pod-adoption-example created replicaset.apps/adopting-replicaset created pod/adopted-pod created # View list of Pods $ k -n pod-adoption-example get po NAME READY STATUS RESTARTS AGE adopting-replicaset-qcwws 1/1 Running 0 6s adopting-replicaset-swjtt 1/1 Running 0 6s adopting-replicaset-x6zwz 1/1 Terminating 0 6s adopted-pod 1/1 Running 0 6s
That’s odd! The manifests file defines a Pod and a ReplicaSet having three replicas, so four Pods should be running at all times – right? Then why is one Terminating immediately after the manifests file has been applied? Let’s check the Pod list again:
$ k -n pod-adoption-example get po NAME READY STATUS RESTARTS AGE adopting-replicaset-qcwws 1/1 Running 0 2m7s adopting-replicaset-swjtt 1/1 Running 0 2m7s adopted-pod 1/1 Running 0 2m7s
As you’d expect, the Terminating Pod has, well, terminated, and there are now only three Pods running. Why is this the case? Let’s take a look at the ReplicaSet’s label selector:
$ k -n pod-adoption-example get rs adopting-replicaset -o custom-columns=NAME:.metadata.name,SELECTOR:.spec.selector.matchLabels NAME SELECTOR adopting-replicaset map[app:hello-service]
Next, which labels are present on
$ k -n pod-adoption-example get po adopted-pod --show-labels NAME READY STATUS RESTARTS AGE LABELS adopted-pod 1/1 Running 0 4m10s app=hello-service
This explains what happened: The ReplicaSet first spawns three new Pods, then runs another filter on the Pod list within the current namespace, and notices there are four Pods matching its label selector. It then acquires the Pod thus far not under its management, which is possible because the Pod explicitly created by means of the PodSpec in the manifests file doesn’t carry any Controller-type owner references in its
metadata.ownerReferences list (after being freshly created, it doesn’t carry any references in there, for that matter), so no other Controller blocks taking ownership.
We can verify this by taking a look at said list:
$ k -n pod-adoption-example get po adopted-pod -o yaml apiVersion: v1 kind: Pod metadata: # ... labels: app: hello-service name: adopted-pod namespace: pod-adoption-example ownerReferences: - apiVersion: apps/v1 blockOwnerDeletion: true controller: true kind: ReplicaSet name: adopting-replicaset uid: b5a6dc5d-469f-4617-9fde-3505e1bc7c35 resourceVersion: "78648" uid: bd693099-0bdc-401c-8182-1a0909597b5c spec: # ...
Since the ReplicaSet is now one replica over the desired number, it terminates one of the existing Pods in order to get down to three replicas, which gives us the final Pod list.
A word of caution: The behavior demonstrated above implies a certain necessity for being careful with labels on manually created Pods (or, more generally: all Pods not carrying a Controller-type owner reference) as existing ReplicaSet Controllers (or indeed all Controllers exhibiting the filter-then-acquire kind of behavior) will adopt such Pods if at least one label matches their selector.
What we’ve done above also works the other way around: We can disassociate a Pod from its owning ReplicaSet by removing the labels matching the ReplicaSet’s selector.
Let’s disassociate the
adopted-pod from the ReplicaSet:
# Perform in-place edit of Pod labels ('hello-service' -> 'quarantined-hello-service') $ k -n pod-adoption-example edit po adopted-pod # ... labels: app: quarantined-hello-service # ... pod/adopted-pod edited # Retrieve Pod list $ k -n pod-adoption-example get po --show-labels NAME READY STATUS RESTARTS AGE LABELS adopting-replicaset-qcwws 1/1 Running 0 25m app=hello-service adopting-replicaset-swjtt 1/1 Running 0 25m app=hello-service adopted-pod 1/1 Running 0 25m app=quarantined-hello-service adopting-replicaset-trzfc 1/1 Running 0 33s app=hello-service
The ReplicaSet noticed there are only two Pods left matching its label selector, and so it spawns a new one to re-align the observed with the desired state. Meanwhile, the now disassociated Pod simply keeps running and is now not under management by the ReplicaSet anymore, which we can verify by taking a look at its
$ k -n pod-adoption-example get po adopted-pod -o custom-columns=OWNERS:.metadata.ownerReferences OWNERS <none>
This can be useful, for example, when a Pod misbehaves – disassociating it from its ReplicaSet will cause the latter to spawn a new, healthy one, but the misbehaving Pod will still be available for live investigation and troubleshooting.
You can delete all resources created and used in scope of the previous sections by running the following two commands:
$ k delete -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/simple-replicaset-with-namespace.yaml namespace "replicaset-example" deleted replicaset.apps "hello-replicaset" deleted $ k delete -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/kubernetes/replicasets/pod-adoption-example.yaml namespace "pod-adoption-example" deleted replicaset.apps "adopting-replicaset" deleted pod "adopted-pod" deleted
A ReplicaSet is responsible for managing a set of Pods. It does so by using a reconciliation loop that constantly compares the desired and the observed state of the Pods in question.
Reconciliation loops are a very powerful concept in Kubernetes crucial to the system’s self-healing capabilities, and what each so-called Controller implementing a reconciliation loop understands as state can vary significantly depending on its responsibilities. The ReplicaSet Controller understands the number of replicas and the type of Pod as state, and whenever the currently running number of replicas deviates from the desired number of replicas, the Controller will spawn new ones of the desired type, or delete existing ones. Both dimensions of state are provided in the ReplicaSet’s manifest using the
spec.template properties, respectively.
A ReplicaSet finds the set of Pods it should be managing by means of labels. Additionally, whenever a ReplicaSet finds new Pods matching its label selector, it will try to acquire them and insert itself into the
metadata.ownerReferences list of all Pods in question, thus establishing an ownership link that is used to monitor all Pods in the cluster. Acquiring a Pod in this way will only work if its
metadata.ownerReferences list does not contain a Controller-type item yet, so in addition to Pod monitoring, the ownership link prevents different Controllers from fighting over the same Pod.
ReplicaSets can be scaled out either by using the
kubectl scale command or by updating the ReplicaSet’s declarative state description. Because
kubectl scale is imperative, using it for scaling operations establishes the ground for errors such as modifying the number of desired replicas of the ReplicaSet on one Kubernetes cluster and forgetting to make the same modification on another Kubernetes cluster in an environment where both clusters should be exact copies of each other. This and the advantages of declarative state descriptions mean that rather than scaling a ReplicaSet imperatively, they should always be scaled declaratively by updating its
The Deployment object is a higher-level abstraction frequently employed in Kubernetes that builds on ReplicaSets. In the next blog post, we’ll take a look at how Deployments leverage ReplicaSets and how they complement their functionality.