This post explains my learning on how to introduce Chaos Engineering to an organisation. This is based on my experience of re-architecting monolithic application to Microservices based architecture. Microservices architecture style structures an application as a collection of loosely-coupled services. Microservices architecture has many benefits like independent development and deployments of services, eliminate long-term commitment to a technology stack, specialized services built by small teams, and many others. One of the drawbacks of Microservices is that it increases the surface area of failures. You now have to deal with failures related to the interaction between services and system boundaries. Our client was facing issues running their distributed application in a steady state. The issues that we faced were:
- Communication failure between services. There was no clear strategy on how to handle network failure between services and how to give proper feedback to the customers of the application.
- Difficulty in understanding why the whole application became unavailable when only a single service was down. Is there any single point of failure? These types of issues were not visible with usual testing.
- System becoming partially unavailable when the network gets choked.
- Unwanted local state leading to system unavailability when one instance of the service dies.
- Out of memory errors in production services leading to complete or partial unavailability of the system.
- Possible data loss issues as data replication and backup strategies were never tested in real workloads.
The issues mentioned above can’t be identified by the usual testing techniques. I have always been an ardent believer in automation testing. We write unit, integration, and functional tests for applications we built. But, these forms of automation testing can’t give you confidence against the issues mentioned above. I am always looking for the latest and greatest happening in their fields. Chaos Engineering is one such discipline emerging in the DevOps automation field, which I am closely following.
So, after a couple of knowledge sharing sessions with the client, we decided to use Chaos Engineering principles as a solution. This helped us overcome the issues mentioned earlier. Thus, making the application more resilient and fault tolerant.
Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production.
Process of Introducing Chaos Engineering in an Organization
The process that we used to systematically introduce Chaos Engineering in the client’s environment is described below:
- Make client comfortable with the concept and principles of Chaos Engineering. This is essential as Chaos Engineering is not well understood. This leads to confusion over what it means and how it can benefit.
- Define the steady state of the application in terms of the key metric for the organization. This is important so that we can objectively understand if our application is working properly or not. As the application was in advertisement domain, we choose the number of ads served per second as our key metric. We looked at the steady state of the application to determine how many ads are served per second when the system was running in steady state. We had monitored this for a few days to become comfortable with the key metric. This is the essential step in carrying out Chaos Engineering. We had many discussions and meetings to define this metric.
Define the framework to systematically introduce faults in the system. We started by manually introducing faults like killing processes and shutting down virtual machines in the test environment. Chaos Engineering recommends running the experiments in a production environment, but we believe to gain the confidence it is important to start running these experiments in test and production-like environments. You need to do homework before you can run these experiments in a production environment. We used tools like Muxy and Wiremock to introduce faults in our test environments. We also looked at services like Gremlin but at that time it was not mature to meet our needs.
Next, we identify a list of faults that we wanted to introduce. For us, the following were the main faults:
- Terminate virtual machine instances using the API
Slow down services by introducing latency between different services
Limit the memory of the instance to force memory errors
Kill the service process
As mentioned above, we ran Chaos Engineering experiments in test environments to find the first level of issues that we can resolve. This gave us time to improve the level of automation and confidence required to run experiments in the production. This step depends on the maturity of the enterprises. If systems are well automated and monitored, then you can think about testing in production. Else, it is better to start by carrying out experiments in the test environments.
Next, we carry out experiments using the steady state. We try to find how much our system deviates from the steady state when failures are injected.
Finally, we decided to automate these tests using the Netflix open source project called Chaos Monkey. But, the problem was that Chaos Monkey can only work with Spinnaker. Spinnaker is a continuous delivery tool by Netflix. Our client was using Jenkins for Continuous Delivery so we had to convince our client to move to Spinnaker. So after a month of working out with Spinnaker, we could run Chaos Monkey in our test environments. We didn’t go with running experiments on the production, but by using virtual load on our test environments and Chaos Monkey, we were able to reproduce many scenarios. This gave us enough confidence that our production services will be able to handle error scenarios better. The plan was to use this approach for the time being and if required to move to running experiments on the production system.
Benefits of Chaos Engineering
It took us six months to go through the whole cycle. The benefits reaped by the client are:
- Increased confidence in the system. By carrying out these experiments we identified various issues with the system. We identified that some of the services keep local state before writing to persistent storage. This could lead to data loss. Chaos Engineering experiments made such issues visible so that the development team can fix them. Application availability has improved substantially after introducing Chaos Engineering.
- Automation became a primary focus for the client. At times, it is difficult to convince a client to spend on automation. By carrying out these experiments, we were able to pinpoint many automation opportunities to the client. At the same time, we invested heavily to make sure the client can run these experiments when we are gone. This is the reason we pushed them to use Spinnaker so that they can introduce experiments as part of the CD pipeline.
- Another positive side effect of this exercise was that monitoring became a central part of the architecture. We worked with the client to define key metrics and introduced an application monitoring tool like AppDynamics to improve the observability of the system. If you can’t measure, you can’t improve.
We learned many lessons by carrying out Chaos Engineering for our client. These are mentioned below:
- It is not easy to convince the client to carry out experiments in the production environment. It takes time before the client can see the benefits of these experiments. So, you need to create a well-defined strategy to carry out experiments in test or production-like environments. Automation becomes key in such environments. This requires a complete change in the mindset and you will have to convince multiple people and teams to make Chaos Engineering successful in an organization.
- Issues are not just technical in nature. Many times, issues are caused by people and processes. So, you really need to look for them and define ways to mitigate them. There are multiple reasons at play causing the problem.
- We need to soon integrate Chaos experiments with our CD tool so that we can run them whenever required. It took us time to get the buy-in from the client to use Spinnaker. Because we were planning to use Chaos Monkey we had to use Spinnaker. There are not many good open source solutions in the market so either you have to create your own tool or convince your client to use a tool that requires changes in the way they deliver software. This could be an issue with clients. We were fortunate that our client agreed to it, but we need to keep this in mind.
- Create a proper schedule for running experiments. You will not be running Chaos experiments on every push so you need to publish dates and schedule so that people are ready beforehand. There is upfront planning required to carry out these experiments.