Chaos Monkey

27 Mar 2023 in Hazelcast & Hazeltest on Hazeltest, Hazelcast, Testing, Chaos monkey 2024-07-12

In case you’ve always wanted a pet monkey, this one is definitely not what you’re looking for, because this particular specimen is one noisy fellow indeed. Designed to wreak havoc among the unsuspecting members of your Hazelcast cluster, it will let you test your cluster’s resilience before your production environment gets a chance to do it.

The concept of a Chaos Monkey was first introduced in the context of Hazeltest by the previous blog post as a means to “spice up” the primary job of Hazeltest’s runners, namely, to generate load on the Hazelcast cluster under test. The Chaos Monkey was described as an automated actor within Hazeltest whose goal is to deliberately wreak havoc among Hazelcast cluster members in order to test the cluster’s resilience towards member failures.

Since the last blog post was published, I’ve worked on the first iteration of the Chaos Monkey feature, and in the following sections, you’ll learn everything you need to know in order to successfully operate the first of the available Chaos Monkeys, the member killer monkey.

Chaos, Please!
The License To Kill… Hazelcast Members
Next Steps
Summary

Chaos, Please!

We’ve talked about the concept of load dimensions within Hazeltest a few times now, most recently during the Hazeltest live stream. The term refers to the different dimensions or “vectors” along which load can be generated on a Hazelcast cluster. Another way to describe this would be to say that from Hazelcast’s perspective, “load” can have multiple sources. Currently, Hazeltest can create load along four dimensions:

Number of entries in data structures
Size of entries in those data structures
Number of data structures in the cluster
Number of clients accessing the cluster

In a world without Murphy’s Law, if a Hazelcast cluster were able to successfully handle load along all four of these dimensions – insert whatever definition you have for “successful” in your specific use case here –, then it would be safe to ship the configuration defining this cluster to production. But, alas, the aforementioned law exists in this world and things do indeed go wrong from time to time, and sometimes such things seem to have the inconvenient habit of going wrong in production environments. As a complex distributed system, Hazelcast is susceptible to network outages and member failures, so it’s up to you to make sure the Hazelcast cluster can continue to function properly (at least to a certain degree, which is again defined by your specific use case) even when some of its members are disconnected from the cluster or outright failing.

This is where the new Chaos Monkey feature comes in: Its goal is to simulate these issues in a safe environment, far away from production, so that flaws in the cluster’s configuration regarding the resilience toward such issues can be spotted, analyzed, and corrected.

The License To Kill… Hazelcast Members

The member killer monkey is the first Chaos Monkey, and its job is to simulate failures of Hazelcast cluster members. We’re going to approach the “How” of operating the member killer monkey from a general perspective in order to then look at example configurations and what impact each configuration has on the Hazelcast cluster, before concluding with a short demo.

The Idea

To begin, we’re going to look at the member killer monkey from a general angle: How does it need to behave, and what configuration options does it need to offer, in order for the feature to generate value for you, the user?

First of all, you’ll probably want some kind of mechanism to determine how long the monkey is supposed to run and when it should stop, and for the configuration to be consistent with Hazeltest’s runners, it would make sense to simply use a configurable number of runs as an upper boundary for the monkey’s runtime.

The next thing you may want to configure in the member killer monkey is the degree of “harshness” or the “difficulty setting” it will impose on the Hazelcast cluster with its actions. This difficulty setting can be described in terms of three questions, namely, how often the monkey will strike, how many members it will kill when it strikes, and how much time it will give the members to shut down gracefully (if any). The current iteration of the member killer monkey allows for the frequency of action as well as the member grace to be configured, but does not include the possibility to shut down more than one member yet.

Finally, since Hazeltest is designed to work best on Kubernetes, the member killer monkey assumes the target Hazelcast cluster runs on Kubernetes, too. In this environment, there are two possible ways to interact with the cluster, the more obvious being the in-cluster access, meaning the Hazeltest instance running the monkey resides in the same Kubernetes cluster. The other possible access mode is out-of-cluster access, and while this may be the slightly less obvious mode since you’ll usually want to run Hazeltest inside Kubernetes next to the Hazelcast cluster it tests, it can still be useful for testing and debugging purposes, e.g. to make sure a configuration yields the desired results before deploying it to Kubernetes.

With this short dry-run in mind, let’s take a look at what configuration the thoughts outlined above yield.

The Configuration

The following configuration structure is a manifestation of the thoughts outlined above:

chaosMonkeys:
  memberKiller:
    enabled: true
    numRuns: 100
    chaosProbability: 0.5
    memberAccess:
      mode: k8sInCluster
      targetOnlyActive: true
      k8sOutOfCluster:
        kubeconfig: default
        namespace: hazelcastplatform
        labelSelector: app.kubernetes.io/name=hazelcastimdg
      k8sInCluster:
        labelSelector: app.kubernetes.io/name=hazelcastimdg
    sleep:
      enabled: true
      durationSeconds: 60
      enableRandomness: false
    memberGrace:
      enabled: true
      durationSeconds: 30
      enableRandomness: true

One notable aspect of this configuration is the use of chaosMonkeys in the plural form, indicating that the Chaos Monkey mechanism was built for extensibility – the memberKiller monkey is the first of potentially many more Chaos Monkeys that could be added in the future. Moving on to the memberKiller config, with the theory in mind that you’ve learnt in scope of the previous section, the elements of this configuration probably won’t strike you as particularly surprising. Yet, there is background information to each property that cannot be conveyed through naming alone, so let’s take a short look at each property in order to uncover everything you might want to know in order to successfully operate the member killer monkey:

enabled: Allows you to enable or disable the monkey. The implication of this is that you could, for example, define a set of 10 Hazeltest instances that generate load, and use a dedicated “killer instance” with no runner, but only the member killer monkey activated. This might be useful, for example, when you want the load-generating instances to be run inside the Kubernetes cluster along with the Hazelcast members, but execute the killer instance locally in your IDE for testing purposes.
numRuns: The monkey will perform the number of runs stored here, and terminate once it has performed the last run. It’s worth noting that once all runners and the member killer monkey have performed their runs, the Hazeltest instance will still stay active, so you can query its status endpoint, /status.
chaosProbability: In each run, the monkey will choose and kill one Hazelcast cluster member based on the probability specified by this property. Note that the criterion for the monkey to terminate is indeed the number of runs it has performed, rather than the number of members killed, and there is currently no configuration option to specify termination behavior based on number of members killed.
memberAccess.mode: The string provided for this property determines how the member killer monkey accesses the members of the target Hazelcast cluster. Besides, this property acts as a means for the monkey to figure out which sub-config should be loaded and applied. We’re going to look at each member access mode and their corresponding sub-config below.
memberAccess.targetOnlyActive: A Hazelcast member can be alive without being active yet. In Kubernetes, this distinction is made in terms of liveness vs. readiness, and I admit that since Hazeltest and its features are designed to work best on Kubernetes, it might have been more concise to simply call the property targetOnlyReady, so by the time you’re reading this, I may have already renamed the property. Either way, though, the purpose of the property is the same: To allow you to specify whether the member killer monkey should only consider members that have become active (ready, in Kubernetes speak), or kill them even prior to that (“spawn-killing” them, if you will).
memberAccess.k8sOutOfCluster.*: As you’ve probably guessed based on its naming, this object holds the properties for the out-ouf-cluster Hazelcast member access, therefore acting as the sub-configuration for this access mode – that is, if you’ve set memberAccess.mode to k8sOutOfCluster, the member killer monkey will attempt to access the Hazelcast members from outside the Kubernetes cluster by means of the configuration the memberAccess.k8sOutOfCluster object defines.
memberAccess.k8sOutOfCluster.kubeconfig: To access Hazelcast members from outside the Kubernetes cluster, the member killer monkey needs to know the address of the cluster’s API server and how to authenticate to it. Both pieces of information are typically stored in a file known as the Kubeconfig, and the memberAccess.k8sOutOfCluster.kubeconfig property tells the monkey where to find this file. In the example above, the property contains the value default, which means the monkey will look for the file in $HOME/.kube/config, where it is usually stored. (On a side note, the necessity to use a Kubeconfig in this access mode is the reason you’d typically want to use it for out-ouf-cluster access – though possible, it would be highly unusual to mount this file into a Pod and have Hazeltest communicate with the Kubernetes cluster’s API server from within the cluster.)
memberAccess.k8sOutOfCluster.namespace: Allows you to specify the namespace in the Kubernetes cluster that contains the target Hazelcast member Pods.
memberAccess.k8sOutOfCluster.labelSelector: With Kubernetes cluster access and the target namespace sorted out, you can use this property to inform the member killer monkey which Pods within that namespace to consider for killing. This is necessary because the namespace in question could be populated by non-Hazelcast Pods, too (the most obvious example would be the Hazelcast Management Center running along with the member Pods), so the set of Pods it considers must be limited to only the Hazelcast member Pods.
memberAccess.k8sInCluster.*: Being the sub-configuration for the second of the two currently supported access modes, this object contains all properties (only one, really) related to the in-cluster access mode, which is the mode you’ll typically want to use when running Hazeltest along with the Hazelcast member Pods in the same namespace of the same Kubernetes cluster. While this access mode is very simple in terms of its configuration, there are two things working behind the curtains to make it work, and it’s useful to keep them in mind:
- Firstly, to be allowed Pod deletion by the Kubernetes API server, the in-cluster access mode relies on Kubernetes-native RBAC artifacts (a ServiceAccount, a Role, and a RoleBinding to connect the two). If you use the Hazeltest Helm chart, you’ll have nothing to worry about except making sure the features.useDeletePodsServiceAccount property in the chart’s values.yaml file is set to true (which is the default).
- Secondly, there is the matter of figuring out the Hazelcast member Pod’s namespace (you’ll see below that the config mode for in-cluster access does not define a namespace property). The in-cluster access mode assumes that the Hazeltest Pod and the target Hazelcast Pods run in the same namespace (support for cross-namespace access might be added in the future). The namespace information can be obtained from either the /var/run/secrets/kubernetes.io/serviceaccount/namespace file (which Kubernetes mounts into the Pod file system by default) or the POD_NAMESPACE environment variable. If none of these methods succeeds, the member killer monkey will “fail fast” by reporting an error and terminating.
memberAccess.k8sInCluster.labelSelector: Same as for the out-of-cluster mode.
sleep.*: In case you’re already familiar with how to configure sleeps for Hazeltest’s runners, the sleep.* object might ring a bell – indeed, both structure and purpose are very similar to the runners’ sleeps. As you might have guessed based on its name, this config object is used to configure how long the member killer monkey will sleep between each run (if you want it to sleep at all, that is). Note that the chaosProbability property explained above is evaluated per run, so providing a sleep duration of, say, 60 seconds won’t necessarily mean the monkey will kill a Hazelcast member every 60 seconds.
sleep.enabled: Allows you to configure whether the member killer monkey should sleep at all.
sleep.durationSeconds: If sleep.enabled has been set to true, the member killer monkey will sleep for the given amount of seconds between each run. (Note the small, but important difference to Hazeltest’s runners, which interpret this number in terms of milliseconds.)
sleep.enableRandomness: Similar to the runners’ sleep config, randomness can be enabled for this sleep config. If set to true, the amount of seconds to sleep will be determined randomly in the closed interval [0, <durationSeconds>]. You’ll probably want to enable randomness for this sleep in most cases because the additional randomness in the monkey’s behavior increases the realism of the simulated member outages (after all, in reality, outages have the inconvenient habit of not occurring with static frequency).
memberGrace.*: Allows you to configure whether the target Hazelcast member Pod the member killer monkey has identified during one run should be terminated gracefully, and, if so, which grace period to apply. (And yes, in case the structure of this config object looks familiar, you’re right: The Golang struct this config is read into is the same also used for the monkey’s sleep config.)
memberGrace.enabled: Whether to enable graceful member termination. If this is set to false, the monkey will use a grace period of zero seconds to overwrite Kubernetes’ default behavior for Pod deletion, which grants 30 seconds of grace period if nothing else is specified. Thus, setting this to false is equivalent to providing true in combination with memberGrace.durationSeconds: 0.
memberGrace.durationSeconds: The duration, in seconds, to grant as grace period. Configuring this to be, say, 60 seconds is the equivalent of running kubectl delete pod awesome-pod --grace-period=60.
memberGrace.enableRandomness: Enables or disables randomness for this sleep configuration. Similar to the monkey’s sleep configuration, I recommend setting this to true for most cases because, in reality, if issues cause Hazelcast members to crash and/or terminate, they sometimes just won’t do you the favor of providing them with enough time to off-load all data and tasks in scope of a graceful shutdown.

Explanations for these properties, though geared a bit more towards brevity, are also available for your reference in Hazeltest’s default config file.

The Result

Now that you’ve seen which configuration properties are available for the member killer monkey and how each property works, let’s put this knowledge to good use and find out how certain combinations influence how the monkey behaves. We’ll start by taking a look at three sample configs.

Sample Configs

The following three sample configs exhibit different “difficulty settings”, if you will, and are provided from easiest to hardest. Note that memberAccess.mode is set to k8sInCluster in all examples, so the config for k8sOutOfCluster is omitted.

Level 1

Let’s kick this off with the easiest difficulty setting:

chaosMonkeys:
  memberKiller:
    enabled: true
    numRuns: 500
    chaosProbability: 0.2
    memberAccess:
      mode: k8sInCluster
      targetOnlyActive: true
      k8sInCluster:
        labelSelector: app.kubernetes.io/name=hazelcastimdg
    sleep:
      enabled: true
      durationSeconds: 600
      enableRandomness: false
    memberGrace:
      enabled: true
      durationSeconds: 600
      enableRandomness: false

Thus configured, the member killer monkey will be pretty easy on Hazelcast: With a probability of only 20 %, it will kill a Hazelcast member every 600 seconds or 10 minutes, granting the member a termination grace period of the same amount (which is the default for Hazelcast’s hazelcast.graceful.shutdown.max.wait property, and in most cases, this wait time will be sufficient for the member to off-load all its data and tasks). The combination of numRuns and chaosProbability shown above will result in approximately 100 kills before the monkey terminates. Note that both sleeps will be static since their enableRandomness properties have been set to false, hence, with the chaosProbability being the only element introducing some randomness, the simulated outages won’t be very realistic.

Let’s amp up our game, then, and give the target Hazelcast cluster a more worthy challenge.

Level 2

In comparison to the config shown above, the following config increases the “difficulty level”:

chaosMonkeys:
  memberKiller:
    enabled: true
    numRuns: 500
    chaosProbability: 0.8
    memberAccess:
      mode: k8sInCluster
      targetOnlyActive: true
      k8sInCluster:
        labelSelector: app.kubernetes.io/name=hazelcastimdg
    sleep:
      enabled: true
      durationSeconds: 360
      enableRandomness: false
    memberGrace:
      enabled: true
      durationSeconds: 600
      enableRandomness: true

So, what’s changed? The number of runs is still the same – it’s not a criterion for the difficulty level, really –, but the probability of actually killing a target member has increased to 80 %. In addition to that, the monkey is active more often – it will sleep for only 360 seconds or 6 minutes between runs –, and enableRandomness has been set to true on the member grace config, so in some cases, the target member won’t have enough time to off-load its data and tasks before being shut down, which, depending on the configuration of the data structures for the cluster, might cause a bit of data loss.

All, in all, though, given the cluster and its data structures have been set up correctly, the config above still shouldn’t cause too much trouble, so let’s take a look at one that will.

Level 3

The third and final of the sample configs is very tough on the Hazelcast cluster. Here it is:

chaosMonkeys:
  memberKiller:
    enabled: true
    numRuns: 500
    chaosProbability: 1.0
    memberAccess:
      mode: k8sInCluster
      targetOnlyActive: false
      k8sInCluster:
        labelSelector: app.kubernetes.io/name=hazelcastimdg
    sleep:
      enabled: true
      durationSeconds: 120
      enableRandomness: true
    memberGrace:
      enabled: false
      durationSeconds: 600
      enableRandomness: true

This instance of the member killer monkey will be a very active fellow indeed – not only will it strike in each run due to the chaosProbability being set to 1.0, we’ve also told it it’s okay to kill Hazelcast members that haven’t become active or ready yet, delaying the overall time for one particular member to become ready again if it was unfortunate enough to be selected multiple times in a row. The monkey’s sleep config also contains changes; we’ve set the durationSeconds property to only 120 seconds and enabled randomness, meaning that in some cases the monkey will sleep for only a couple of seconds before engaging again (which is where the targetOnlyActive: false setting will make a difference). Lastly, the member grace has been disabled altogether by setting memberGrace.enabled to false, meaning Hazelcast members will be killed instantly without waiting for them to terminate gracefully. This significantly increases the risk of data loss in the cluster.

It would be easy to further increase the difficulty level – the most straightforward way to achieve this would be to simply decrease the value for sleep.durationSeconds –, so level 3’s config is by no means the harshest config imaginable (in fact, we’ll work with harsher configs in scope of the live demo down below). Yet, I hope this three-level approach highlighted the way which properties are involved in controlling the difficulty level, and how they affect that level.

Live Demo

Having looked at what properties the member killer monkey’s config consists of and how sample configs can look like, let’s do a short live demo next (as live as it can get on a blog using only text and screenshots, anyway). The following sections will walk you through two scenarios, each acting as an example for how the member killer monkey can be used. We’re going to start with a simple scenario.

As always, Helm charts and their corresponding values.yaml files are provided in case you would like to follow along:

You can find the Helm charts for deploying Hazelcast and Hazeltest in the resources/charts directory of the Hazeltest repository on GitHub. The easiest way to follow along is to simply git clone the entire repository and work in a shell session from inside the resources/charts directory. (All helm commands you’ll encounter down the line assume you are in this directory.)
Just like in the previous blog posts, the helm upgrade --install commands we’re going to employ to set up Hazelcast and Hazeltest make use of the contents of values.yaml files in a sub-directory of my blog-examples repository.

Unlike in the previous blog posts, however, the Helm commands will refer directly to the values.yaml files by means of the -f flag – this will make the commands longer, but on the plus side, you’ll be able to simply copy-paste them without setting up your local values.yaml file first, so it will be easier for you to run the commands.

Finally, in order to run the scenarios, you’ll need a reasonably juicy Kubernetes cluster. In case you don’t have one at your disposal yet, these blog posts (or sections therein, respectively) may be helpful:

List of Kubernetes offerings in Meet Your Helmsman!
RKE Installation Or: Becoming A Rancher

Scenario 1

First, we’re going to need a Hazelcast cluster at which to unleash the rage of our member killer monkey. The following excerpt of a values.yaml file (full version here) shows the most important bits of our Hazelcast cluster’s configuration:

platform:
  deploy: true
  # ...
  cluster:
    # ...
    members:
      image: hazelcast/hazelcast:5.2.1
      imagePullPolicy: IfNotPresent
      count: 3
      # ...
      gracefulShutDown:
        maxWaitSeconds: 600
      containerResources:
        requests:
          cpu: "2"
          memory: "6G"
        limits:
          cpu: "2"
          memory: "6G"
      jvmResources:
        xmx: 4800m
        xms: 4800m
      # ...
  config:
    hazelcast:
      map:
        default:
          # ...
        ht_*:
          backup-count: 1
          # ...
        ht_no_backup_*:
          backup-count: 0
          # ...

So, this cluster will consist of three members, each requesting, and limited to, 2 CPUs and 6 GBs of RAM, meaning your Kubernetes cluster should have at least 6 CPUs and 18 GBs of RAM available, plus some extra for the Hazeltest instance. Concerning the map configurations, the ht_* map pattern is set up to use one backup, whereas the ht_no_backup_* pattern is so named because – surprise – it doesn’t have backups. We’re going to configure our Hazeltest instance so its runners write to maps corresponding to both patterns, as you’ll see below.

Run the following command to install a Hazelcast cluster with this configuration:

$ helm upgrade --install hazelcastwithmancenter ./hazelcastwithmancenter --namespace=hazelcastplatform --create-namespace -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/hazeltest/chaos_monkey/member_killer_monkey/scenario_1/01_hazelcast_values.yaml

Once all members have come up, it’s time to install Hazeltest. To keep things simple, we’re going to use only one Hazeltest instance for now, which will have two jobs; to create load, and to run the member killer monkey. The following excerpt from the values.yaml file we’re going to use in scope of deploying Hazeltest shows the monkey’s configuration:

config:
  chaosMonkeys:
    memberKiller:
      enabled: true
      numRuns: 500
      chaosProbability: 1.0
      memberAccess:
        mode: k8sInCluster
        targetOnlyActive: true
        k8sInCluster:
          labelSelector: app.kubernetes.io/name=hazelcastimdg
      sleep:
        enabled: true
        durationSeconds: 60
        enableRandomness: false
      memberGrace:
        enabled: false
        durationSeconds: 600
        enableRandomness: false

The difficulty level this config expresses is similar to the sample level-3 config you encountered in the previous section – while the sleep.durationSeconds property contains a smaller value, the monkey’s sleeps are static thanks to sleep.enableRandomness: false, so, at least, the Hazelcast members will have enough time to become ready (though this depends on how long the Hazelcast member’s startup process takes on your Kubernetes cluster – adjust the sleep time if needed). The other properties remain unchanged compared to the level-3 config from the previous section.

In terms of the runners, the values.yaml file disables both queue runners and enables both map runners. Here’s an excerpt of their config, showing only the detail necessary for this scenario to work:

config:
  # ...
  mapTests:
    pokedex:
      enabled: true
      # ...
      mapPrefix:
        enabled: true
        prefix: "ht_no_backup_"
    load:
      enabled: true
      # ...
      mapPrefix:
        enabled: true
        prefix: "ht_"

Unsurprisingly, the map prefixes match the two prefixes of the Hazelcast map configuration deployed earlier on.

Armed with this configuration, you can install Hazeltest like so:

$ helm upgrade --install hazeltest ./hazeltest --namespace=hazelcastplatform -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/hazeltest/chaos_monkey/member_killer_monkey/scenario_1/02_hazeltest_values.yaml

Let the Hazeltest Pod run a couple of minutes so the two map runners and the member killer monkey can perform their respective tasks. After the monkey has killed some members, we can check the logs of the single Hazeltest Pod. Because the monkey was configured to kill Hazelcast members without letting them shut down gracefully, what we expect to see is data loss in ht_no_backup* maps, indicated by warning messages saying that attempts to read data from these maps failed. So, when viewing Hazeltest’s Pods, you should see something akin to the following (truncated horizontally for better readability):

# As always...
$ alias k=kubectl
$ k -n hazelcastplatform logs --selector="app.kubernetes.io/name=hazeltest" -f | grep "failed to read data"
{..., "msg":"failed to read data from map 'ht_no_backup_pokedex-4' in run 53: value retrieved from hazelcast for key '5921ed95-b20c-48be-ba2f-a0f6e0d49a7f-4-1' was nil -- value might have been evicted or expired in hazelcast","time":"2023-03-15T21:02:34Z"}
{..., "msg":"failed to read data from map 'ht_no_backup_pokedex-9' in run 49: value retrieved from hazelcast for key '5921ed95-b20c-48be-ba2f-a0f6e0d49a7f-9-1' was nil -- value might have been evicted or expired in hazelcast","time":"2023-03-15T21:02:35Z"}
{..., "msg":"failed to read data from map 'ht_no_backup_pokedex-5' in run 49: value retrieved from hazelcast for key '5921ed95-b20c-48be-ba2f-a0f6e0d49a7f-5-1' was nil -- value might have been evicted or expired in hazelcast","time":"2023-03-15T21:02:35Z"}

(Side note: Currently, the test loops of Hazeltest’s map runners work in terms of operation batches – they first write all data from their configured data set, then read all data, and then delete some of the data. Depending on which operation batch was running whenever the monkey killed a member, reads may not have been affected by this action. So, you might have to wait a little before the aforementioned warning messages appear.)

What you should not see, on the other hand, are warning messages concerning the ht_* maps – since the current setup of the member killer monkey will only cause one Hazelcast member to fail at a time and the ht_ map pattern was configured to have one backup and even allows backup reads, the runners in Hazeltest writing to the corresponding maps should not have any trouble reading their values:

$ k -n hazelcastplatform logs --selector="app.kubernetes.io/name=hazeltest" -f | grep -v "ht_no_backup_" | grep "failed to read data" 
<no results>

This delivers proof that the config for the ht_ map pattern is sufficiently robust to safely retain its state in the face of one single member failing at a time. Thus, to cause some data loss in ht_ maps, we should invite another raging baboon to the party!

Scenario 2

In this second scenario, our goal is to use the member killer monkey to prove that the ht_ map configuration is not sufficiently robust to prevent data loss whenever two or more Hazelcast members fail at the same time. To do so, we need a Hazelcast cluster, of course, some data stored in its maps, and at least two active member killer monkeys.

The Hazelcast cluster configuration we’re going to employ didn’t change since scenario 1, so in case your cluster is still up and running, you’re good to go. In case not, here’s the Helm command to deploy the cluster for your convenience (the -f flag points to a values.yaml file in the scenario_2 sub folder in order to make the installation process more straightforward in case you just joined for scenario 2, but the content is the same as for scenario 1):

$ helm upgrade --install hazelcastwithmancenter ./hazelcastwithmancenter --namespace=hazelcastplatform --create-namespace -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/hazeltest/chaos_monkey/member_killer_monkey/scenario_2/01_hazelcast_values.yaml

Next up, we’re going to need two Hazeltest instances – one generating load and running the member killer monkey, and another one running only a member killer monkey. You’ve already met the configuration for the first Hazeltest instance – it’s the same we’ve used in the previous scenario –, but you haven’t seen the config for the member killer monkey of the second instance yet. Here it is (full file here):

config:
  chaosMonkeys:
    memberKiller:
      enabled: true
      numRuns: 500
      chaosProbability: 1.0
      memberAccess:
        mode: k8sInCluster
        targetOnlyActive: true
        k8sInCluster:
          labelSelector: app.kubernetes.io/name=hazelcastimdg
      sleep:
        enabled: true
        durationSeconds: 120
        enableRandomness: false
      memberGrace:
        enabled: false
        durationSeconds: 600
        enableRandomness: false
  # ...

The only property that has changed in this config compared to the first one is the value for sleep.durationSeconds, which is 120 here rather than 60. Thus configured, the two member killer monkeys will kill two Hazelcast members at a time every 120 seconds (given they don’t strike precisely at the same time, that is, as this would entail the risk of both monkeys selecting the same member).

To make this work, however, these Hazeltest instances have to be deployed one right after the other (never mind a couple of seconds delay, but the second monkey should strike as long as the member killed by the first one hasn’t been re-integrated into the Hazelcast cluster yet). Because of this, even though the config for the first Hazeltest instance is identical to the config used in scenario 1, I suggest you uninstall and then re-install the Helm chart in order to get the timing right.

After the de-installation, you can re-install the first Hazeltest instance using the following command (note the release name is now hazeltest-load-and-monkey rather than just hazeltest):

$ helm upgrade --install hazeltest-load-and-monkey ./hazeltest --namespace=hazelcastplatform -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/hazeltest/chaos_monkey/member_killer_monkey/scenario_2/02_hazeltest_values.yaml

Immediately after that, use the following command to deploy the second Hazeltest instance:

$ helm upgrade --install hazeltest-only-monkey ./hazeltest --namespace=hazelcastplatform -f https://raw.githubusercontent.com/AntsInMyEy3sJohnson/blog-examples/master/hazeltest/chaos_monkey/member_killer_monkey/scenario_2/03_hazeltest_values.yaml

This should cause two Hazelcast members to get killed every 120 seconds, and when this happens, your list of Hazelcast Pods should look roughly like the following:

$ watch kubectl -n hazelcastplatform get po --selector="app.kubernetes.io/name=hazelcastimdg"
NAME              READY   STATUS    RESTARTS   AGE
hazelcastimdg-0   0/1     Running   0          13s
hazelcastimdg-1   1/1     Running   0          3m13s
hazelcastimdg-2   0/1     Running   0          33s

Now, if you move ahead and check the logs of the two Hazeltest instances (the single Pod of the hazeltest-load-and-monkey release would be sufficient, in fact), you should see the expected warning messages, informing us about failed reads both in the ht_no_backup_pokedex*, but also in the ht_load* maps (again horizontally truncated for better readability):

$ k -n hazelcastplatform logs --selector="app.kubernetes.io/name=hazeltest" -f | grep "failed to read data"
{..., "msg":"failed to read data from map 'ht_load-0' in run 21: value retrieved from hazelcast for key '8fa26eb1-8bed-438f-9ca6-3d57093c880b-0-3' was nil -- value might have been evicted or expired in hazelcast","time":"2023-03-17T19:43:24Z"}
{..., "msg":"failed to ingest data into map 'ht_no_backup_pokedex-0' in run 1410: EOF: EOF","time":"2023-03-17T19:43:29Z"}
{..., "msg":"failed to read data from map 'ht_no_backup_pokedex-2' in run 1328: value retrieved from hazelcast for key '8fa26eb1-8bed-438f-9ca6-3d57093c880b-2-98' was nil -- value might have been evicted or expired in hazelcast","time":"2023-03-17T19:43:29Z"}
{..., "msg":"failed to read data from map 'ht_no_backup_pokedex-5' in run 1402: value retrieved from hazelcast for key '8fa26eb1-8bed-438f-9ca6-3d57093c880b-5-76' was nil -- value might have been evicted or expired in hazelcast","time":"2023-03-17T19:43:29Z"}

This proves in a reproducible manner that the configuration for the ht_ map pattern, while resilient towards one member failure, is not robust enough to prevent data loss in the face of two or more members failing at the same time.

Querying Monkey Status

This wraps up the live demo for this blog post, but before we move on to perform a short clean-up, I’d like to show you another thing you might find useful when working with the member killer monkey. From previous blog posts – or from having played around with Hazeltest yourself for a little –, you may be familiar with Hazeltest’s status endpoint. While the application’s reporting capabilities are still rather limited and provide potential for being significantly extended – a potential I’d like to realize in scope of this ticket, at least for the runners –, today’s status endpoint does offer basic information about the member killer monkey.

The Hazeltest Pods themselves don’t come with curl or a similar tool installed to query their own endpoints because the Hazeltest image is designed to be as light-weight as possible, but fortunately, there is just the right container image for us to use in this case, which comes with curl and jq (the latter for parsing the json output so it’s easier to read). Run the following command to launch the Pod that will help us query the status endpoint:

$ cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: curly-mc-curlface
  namespace: hazelcastplatform
  labels:
      name: curly-mc-curlface
spec:
  containers:
  - name: curly-mc-curlface
    image: badouralix/curl-jq:alpine
    command: ['sh', '-c', 'echo "so sleepy..." && sleep infinity']
EOF

Your namespace should now contain the following Pods:

$ k -n hazelcastplatform get po
NAME                                         READY   STATUS    RESTARTS   AGE
curly-mc-curlface                            1/1     Running   0          5m38s
hazelcastimdg-0                              1/1     Running   0          97s
hazelcastimdg-1                              1/1     Running   0          3m2s
hazelcastimdg-2                              1/1     Running   0          41s
hazelcastimdg-mancenter-5f694d66d6-7kz55     1/1     Running   0          35m
hazeltest-load-and-monkey-5484556464-kk9kr   1/1     Running   0          27m
hazeltest-only-monkey-7d558db488-kd5pn       1/1     Running   0          27m

Because the Hazeltest chart does not roll out a Service object, we need to query the status endpoint using the IP of one of the Hazeltest Pods from the curly-mc-curlface Pod. The following command helps you quickly identify the IP addresses:

# The output will, of course, likely look different in your case
$ k -n hazelcastplatform get po --selector="app.kubernetes.io/name=hazeltest" --output jsonpath='{range .items[*]}{.status.podIP}{"\n"}{end}'
10.42.0.83
10.42.0.84

Next, let’s exec into the curly-mc-curlface Pod:

$ k -n hazelcastplatform exec curly-mc-curlface -- /bin/sh

From within the Pod, use whichever of the previously queried IP addresses suits you best in the following command:

$ while true; do curl -s -o - <hazeltest_pod_ip>:8080/status | jq . && echo; sleep 3; done

This should give you roughly the following output:

{
  "chaosMonkeys": {
    "memberKiller": {
      "finished": false,
      "numMembersKilled": 24,
      "numRuns": 500
    }
  },
  "testLoops": {
    "maps": {
      "loadRunner": {
        "finished": false,
        "numMaps": 10,
        "numRuns": 10000,
        "totalNumRuns": 100000
      },
      "pokedexRunner": {
        "finished": false,
        "numMaps": 10,
        "numRuns": 10000,
        "totalNumRuns": 100000
      }
    },
    "queues": {}
  }
}

The numbers you’re seeing will likely vary a bit, of course, but the point is you can use the /status endpoint to find out how many Hazelcast members to member killer monkey has killed so far, and whether it’s already finished.

Cleanup

With their missions achieved, let’s give our monkeys as well as the Hazelcast Pods and good ol’ curly-mc-curlface some rest, and do a short clean-up. The most straightforward way to achieve a clean state is to simply delete the entire namespace we’ve worked in:

$ k delete ns hazelcastplatform

Next Steps

One of the next steps I’d like to implement not only in the context of Chaos Monkeys but Hazeltest in the wider sense is to add more information to the application’s status endpoint, and while it might be tempting to get started with this first because adding more information contributes to the applications’s reporting capabilities, which are, in turn, the foundation for building test automation on top of Hazeltest, there is something that needs attention first, and it’s related with the way the map runner’s test loop implementation currently works.

Right now, the map runners’ test loops first write all data, then read all data, and then delete some of the data. This rather primitive test loop implementation was sufficient in the beginning in order to address the most urgent load generation use cases I had in my client’s team, but it’s not particularly realistic – how many real-world client applications do you know that first write all their data in order to then read all data back in? Rather, client applications will usually have a mix of operations going on at the same time. Because load generation is the bread-and-butter use case Hazeltest was meant to address, I’d like to improve this aspect first before moving on to adding more information to the application’s status endpoint.

Summary

The Chaos Monkey concept is the newest feature of Hazeltest, designed to deliberately cause chaos in the Hazelcast cluster under test. The member killer monkey is the first of potentially many more inhabitants in this rather crazy zoo – its goal is, no surprise, to kill Hazelcast members so you can test the resilience of both the cluster itself and the data structures it holds towards such failures. In this way, any flaws in the cluster and data structure configuration with regards to member failure resilience can be identified and corrected in a safe environment, far away from production.

As a feature that allows you to test cluster resilience, the member killer monkey does not create load on the cluster in and of itself, at least not along the four load dimensions along which Hazeltest’s runners work; rather, it amplifies and increases existing load by making the cluster handle member failures in addition to its usual operations.

Just like Hazeltest’s runners, the member killer monkey is flexibly configurable so you can fine-tune it to your use-cases, and in the preceding sections, we’ve taken a look at five sample configs in total, three in a “dry-run” kind of way and two in scope of the live demos’ two scenarios. The live demos provided simple examples of how the member killer monkey can be used to identify the degree to which given data structures, based on their configuration, can handle member failures. In scope of the live demo, you’ve also seen how you can query a Hazeltest Pod’s status endpoint to find out, among other things, how many members the monkey has already killed.

Reporting is the foundation on which to build higher levels of automation – for example, your very own evaluation logic of how well your Hazelcast cluster configuration was able to deal with a certain kind of load in the face of random member failures –, but the amount of information currently available in the status endpoint does not allow for very sophisticated evaluation logic just yet, and for this reason, I’d like to enrich and add to the information. In order to improve the realism of load generated by Hazeltest’s map runners and prevent false positives in the runners’ logs informing about failed reads, the map runner test loop implementation needs some attention first, though, and that’s what I’m going to address next.

So, stay tuned for more news on Hazeltest!

Chaos Monkey

Chaos, Please!

The License To Kill… Hazelcast Members

The Idea

The Configuration

The Result

Sample Configs

Live Demo

Next Steps

Summary

Evolution

Error

Chaos, Please!

The License To Kill… Hazelcast Members

The Idea

The Configuration

The Result

Sample Configs

Live Demo

Next Steps

Summary

Templates (for web app):

Error