Injecting Faults in Production Environments for Recovery and Learning

As introduced at the beginning of this chapter, injecting errors into production environments (such as using a “Chaos Monkey”) is a way to enhance recoverability. This section describes the process of rehearsing and injecting faults into the system to ensure that it is designed and built correctly, allowing faults to occur in a specific and controlled manner. We ensure the system fails gracefully by conducting tests regularly (or even continuously).Michael Nygard, author of “Release It! Design and Deploy Production-Ready Software,” comments: “Just as you build crumple zones in cars to absorb the energy of a collision and protect passengers, you can decide which system functions are essential and build failure modes that keep hazards away from those functions. If you do not design for failure modes, unpredictable situations will arise, which are often dangerous.”Recoverability requires us to first define failure modes and then test to ensure these failure modes operate as designed. One approach is to inject faults in the production environment and rehearse large-scale failures. This way, we can be confident that the system can self-recover when incidents occur, ideally without impacting customers.The story of the complete outage of Netflix and AWS in the US East region in 2012 is just one example. There is an even more interesting case regarding Netflix’s recoverability. During the “2014 Amazon EC2 server mass reboot” incident, nearly 10% of Amazon EC2 servers had to be restarted to apply an emergency security patch to Xen. Netflix’s cloud database engineer Christos Kalantzis recalls: “When we received the notification about the EC2 emergency reboot, our jaws almost dropped. When we got the list of affected Cassandra nodes, I felt very worried.” However, Kalantzis continued, “Then I remembered all the ‘Chaos Monkey’ drills we had gone through. My reaction was: ‘Bring it on!'”The results were once again surprising. In the production environment, there were over 2,700 Cassandra nodes, of which 218 were restarted, and 22 failed to start successfully. Kalantzis and Bruce Wong from Netflix’s Chaos Engineering department wrote: “Netflix had zero downtime that weekend. Even at the persistent (database) layer, we should regularly conduct failure drills, which should be part of every company’s recoverability planning. If the Cassandra database team had not participated in the ‘Chaos Monkey’ drills, the outcome of this story could have been very different.”Even more surprisingly, no one had to work overtime due to the Cassandra node incident, and there were even no people in their office—they all went to Hollywood to attend a celebration party for a milestone acquisition. This example illustrates from another perspective that proactively focusing on recoverability allows companies to handle events that would trigger a crisis in most organizations in a routine and normal manner.Previous Article• The Silent Dialogue: On the Unspoken Parts of Communication

Follow my public account “April Liu Jingjing“, original articles pushed in real-time

Related posts

Leave a Comment Cancel reply