Byte-Monkey: Bytecode-level fault injection for the JVM

The software development community is full of memes.

We replaced our monolith with micro services so that every outage could be more like a murder mystery.
— Honest Status Page (@honest_update) October 7, 2015

No, not what you’d expect to find on the Reddit front-page, but instead “an idea, behavior, or style that spreads from person to person within a culture” (from Wikipedia).

Well-known software development memes include “TDD is dead”, “Java is slow”, “What’s today’s new Javascript framework/build tool?”, “MongoDB is Snapchat for databases”. This project and post grew from a meme I noticed on Twitter about microservices, the gist being “you can simulate microservices by adding latency to all your method calls”.

Hmm. That’s actually an interesting idea. Let’s do it.

Introducing Byte-Monkey

Byte-Monkey takes inspiration from Netflix’s chaos-monkeys, which have become synonymous with fault-tolerance experiments through controlled failure injection. It runs on the JVM, and twiddles the bytecode of your app to introduce the kind of failures you might encounter such as exceptions and latency.

Using with a JVM app

Byte-Monkey is loaded as a java agent during JVM startup.

java -javaagent:byte-monkey.jar=mode:fault,rate:0.5,filter:org/eclipse/ -jar your-app.jar

The above configuration would run it in Fault mode (throw exceptions) but only instrument classes in packages under org/eclipse/ and throw exceptions 50% of the time. More information about configuration options is on the GitHub Page

Why would I want to use this?

I’ve made a couple of posts previously about the utility of running controlled experiments to discover failure cases, at the application level or infrastructure level. Such drills give us the power and knowledge to answer questions about the behaviour of our multi-actor systems under different modes of failure.

Byte-Monkey addresses the scenario where you want to test something inside your JVM app that may not be triggered by external factors. It lets you answer questions like “If our db connection driver started exhibiting faults every 1/10 operations, does the app release connections properly or does it cause issues?” or “What happens if our http client suddenly starts holding onto connections for an extra 100ms?”

With Byte-Monkey, you can turn up the chance of failures occurring and see how your system behaves without having to make any adjustments to the application code itself.

Implementation

The detailed internals of how Byte-Monkey changes the application code are in the Github project README