Injecting Application Failures in Production

At Pipeline recently, Benji and I gave a talk on Testing in Production. One of the topics we talked about was the importance of failure injection in production, which I’m going to elaborate on a bit in this post.

Application failures vs Infrastructure Failures

Netflix gets a lot of interest and press from its Simian Army of “chaos monkeys” - applications that deliberately sabotage the infrastructure layer of their production systems by shutting off virtual machines, or isolating datacentres. They are making themselves more resilient by opting to deliberately fail this way, in order to learn from their production infrastructure.

While this is interesting for us, we are more concerned with how we go about injecting failures at the application level. In particular, we have specific concerns related to our knowledge domain - real-time adserving.

Real-time Ad Auctions in less than a Minute

What happens when a user loads a page with an ad-placement? A number of things:

The ad-unit, usually a small javascript tag, makes a request to an ad exchange with data about the page/user.
The ad exchange formats this data into a standards-compliant Bid Request and sends it out to a pool of bidders.
Once all the bidders have responded (or failed to hit the auction timeout), a winning bid is selected.
The ad-markup contained in the winning bid is returned to the ad-unit and is rendered out.

We conform to an open standard for real-time ad auctions which enables easy integration with third parties.

Monkeys in the Pool

The core of our business is predicated around being able to effectively run these auctions, so we added our own malfunctioning bidders to each exchange’s pool during weekly production load-tests.

Why? Badly-behaved third party integrations or errors from our own internal bidders have the potential to break our publishers’ pages if we can’t load an ad promptly. In the spirit of learning as much as possible from our production system, we selected a few “bad cases” to provoke so we could observe the way the system reacted.

Taking a leaf from Netflix’s chaos monkeys, we affectionately refer to these applications as our Monkey Bidders.

What we learned

By including a bidder that returns badly-formed data instead of spec-compliant Bid Responses we ensured that the auction continued to function effectively in the face of integrations that violated specification, but the biggest win was by forcing a timeouts.

When we added a malfunctioning bidder to deliberately timeout, which slept for 30 seconds rather than responding in the industry-standard 120ms, the load-test unconvered a bug in the HTTP client we were using. As OpenRTB is normally JSON over HTTP, the spec involved making a POST request to each bidder. However, the client we were using was not having its threads garbage-collected correctly after the auctions concluded.

Since we handle a lot of requests per second, this caused our used memory to increase rapidly and force OutOfMemory errors on our JVMs. Had we not caught this error during load-tests, a period of heavy load could have knocked out several exchanges before we were able to react.

Conclusion

We heavily recommend injecting application failures in your production environment. By staking out the most important edge cases and forcing them to happen, you can have an additional safety net knowing that your applications will at worst fail gracefully in the face of such activity.