This post explains my learning on how to introduce Chaos Engineering to an organisation. This is based on my experience of re-architecting monolithic application to Microservices based architecture. Microservices architecture style structures an application as a collection of loosely-coupled services. Microservices architecture has many benefits like independent development and deployments of services, eliminate long-term commitment to a technology stack, specialized services built by small teams, and many others. One of the drawbacks of Microservices is that it increases the surface area of failures. You now have to deal with failures related to the interaction between services and system boundaries. Our client was facing issues running their distributed application in a steady state. The issues that we faced were:
- Communication failure between services. There was no clear strategy on how to handle network failure between services and how to give proper feedback to the customers of the application.
- Difficulty in understanding why the whole application became unavailable when only a single service was down. Is there any single point of failure? These types of issues were not visible with usual testing.
- System becoming partially unavailable when the network gets choked.
- Unwanted local state leading to system unavailability when one instance of the service dies.
- Out of memory errors in production services leading to complete or partial unavailability of the system.
- Possible data loss issues as data replication and backup strategies were never tested in real workloads.