Better Than Unit Tests

Don't get me wrong, I love unit testing. The practice of unit testing is probably the most important quality innovation in my whole career. Unit testing has spread beyond the agile development community where it started into the mainstream, and we are all better off for it.

We need to be aware of some serious limitations, though. For example, in "Out of the Tar Pit", Moseley and Marks say:

The key problem with testing is that a test (of any kind) that uses one particular set of inputs tells you nothing at all about the behaviour of the system or component when it is given a different set of inputs. The huge number of different possible inputs usually rules out the possibility of testing them all, hence the unavoidable concern with testing will always be, "have you performed the right tests?" The only certain answer you will ever get to this question is an answer in the negative --- when the system breaks.

We can only write unit tests with a certain number of input cases. Too few, and you miss an important edge case. Too many, and the cost of maintaining the tests themselves becomes onerous.

Worse yet, we know that unit tests are inadequate when we need to test overall system properties, in the presence of GUIs, and when concurrency is involved.

So here are four testing strategies that each supplement unit tests with more ways to gain confidence in your fully assembled system.

Automated Contract Testing

Automated Contract Testing uses a data-oriented specification of a service to help with two key tasks:

  • Exercise the service and verify that it adheres to its invariants.
  • Simulate the service for development purposes.

It looks like this:

Automated Contract Testing

Some things to consider when using this type of testing:

  • The contract model should be written by a consumer to express only the parts of the service interface they care about. (If you overspecify by modeling things you don't actually use, then your tests will throw false negatives.)
  • The supplier should not write the contract. Consumers write models that express their desired interface partly to help validate their understanding of the protocol. They may also uncover cases that are in the supplier's blind spot.
  • The test double should not make any assumptions about the logical consistency of the results with respect to the parameters. You should only be testing the application code and the way it deals with the protocol.
  • E.g., if you are testing an "add to cart" interface, do not verify that the item you requested to add was the one actually added. That is coupling to the implementation logic of the back end service. Instead, simply verify that the service accepts a well-formed request and returns a well-formed result.

A nice library to help with this style of test is Janus.

Property-based Testing

Property-based testing is a derivative of the "formal specifications" ideas. It uses a model of the system to describe the allowed inputs, outputs, and state transitions. Then it randomly (but repeatably) generates a vast number of test cases to exercise the system. Instead of looking for success, property-based testing looks for failures. It detects states and values that could not have been produced according to the laws of the model, and flags those cases as failures.

Property-based testing looks like this:

Property-based testing with Quickcheck

The canonical property-based testing tool is Quickcheck. Many partial and fragmentary open-source tools claim to be "quickcheck clones" but they lack two really important parts: search-space optimization and failure minimization. Search-space optimization uses features of the model to probe the "most important" cases rather than using a sheer brute-force approach. Failure minimization is an important technique for making the failures useful. Upon finding a failing case, minimization kicks in and searches for the simplest case that recreates the failure. Without it, understanding a failure case is just about as hard as debugging an end-user defect report.

Considerations when using property-based testing:

  • Specifying the model is a specialized coding task. The model is often 10% of the size of the original system, but can still be very large.
  • Test failures sometimes indicate a problem in the system and sometimes a problem in the model. Most business systems are not very well specified (in the rigorous CS sense of the term) and suffer from many edge cases. Even formal standards from international committees can be riddled with ambiguities and errors.
  • If the system under test is not self-contained, then non-repeatable test failures can confuse the framework. Test isolation (e.g., mocks for all integration points) is really essential.
  • This is not a cheap approach. Developing a robust model can take many months. If a company commits to this approach, then the model itself becomes a major investment. (In some ways, the model will be more of an asset than the code base, since it specifies the system behavior independently of any implementation technology!)

Fault Injection

Fault Injection is pretty much what it sounds like. You run the system under test in a controlled environment, then force "bad things" to happen. These days, "bad things" mostly means network problems and hacking attacks. I'll focus on the network problems for now.

One particular fault injection tool that has some interesting results lately is Jepsen. Jepsen's author, Kyle Kingsbury, has been able to demonstrate data loss in all of the current crop of eventually consistent NoSQL databases. You can clone that repo to duplicate his results.

It looks like this:

Fault Injection With Jepsen

Jepsen itself runs a bunch of VMs, then generates load. While the load is running against the system, Jepsen can introduce partitions and delays into the virtual network interfaces. By introducing controlled faults and delays in the network, Jepsen lets us try out conditions that can happen "in the wild" and see how the system behaves.

After running the test scenario, we use a validator to detect incorrect results.


  • Jepsen itself doesn't provide much help for validation. As delivered, it just tries to store a monotonic sequence of integers. For application-specific tests, you must write a separate validator.
  • Generating load needs to be predictable enough to verify the data and messages out from the system. That either means scripts or pseudo-random cases with a controlled seed.
  • This is another test method that can't prove success, but can detect failures.

Simulation Testing

Simulation testing is the most repeatable of these methods. In simulation testing, we use a traffic model to generate a large volume of plausible "actions" for the system. Instead of just running those actions, though, we store them in a database.

The activity model is typically a small number of parameters to describe things like distribution of user types, ratio of logged-in to not-logged-in users, likelihood of new registrations, and so on. We use these parameters to create a database of actions to be executed later.

The event stream database will be reused for many different test runs, so we want to keep track of which version of the model and event generator were used to create it. This will be a recurring pattern with simulation testing: we always know the provenance of the data.

Simulation Testing with Simulant

The simulation runner then executes the actions against the system under test. The system under test must be initialized with a known, versioned data set. (We'll also record the version of starting dataset that was used.) Because this runner is a separate program, we can turn a dial to control how fast the simulation runs. We can go from real time, to double speed, to one-tenth speed.

Where most test methods would verify the system output immediately, simulation testing actually just captures everything in yet another database. This database of outputs includes the final dataset when the event stream was completed, plus all the outputs generated by the system during the simulation. (Bear in mind that this "database" could just be log files.) These are the normal outputs of the system, just captured permanently.

Like everything else here, the output database is versioned. All the parameters and versions are recorded in the test record. That way we can tie a test report to exactly the inputs that created it.

Speaking of the test report, we run validations against the resulting dataset as a final process. Validations may be boolean checks for correctness (is the resulting value what was expected?) but they may also verify global properties. Did the system process all inputs within the allowed time? Were all the inputs accepted and acted on? Does "money in" balance with "money out"? These global properties cannot be validated during the simulation run.

An interesting feature of this model is that validations don't all need to exist when you run the simulation. We can think of new things to check long after the test run. For example, on finding a bug, we can create an probe for that bug and then find out how far back it goes.

Simulant is a tool for building simulation tests.


  • This approach has some similarities to property-based testing. There are two major differences:
    • Property-based testing has a complete formal model of the system used for both case generation and invariant checking. In contrast, simulation testing uses a much simpler model of incoming traffic and a separate set of invariants. Splitting these two things makes each one simpler.
    • Property-based testing runs its assertions immediately. To add new assertions, you must re-run the whole test.
  • Because assertions are run after the fact, you can verify global properties about the system. For example, does "money in" balance with "money out"?
  • Using virtual machines and storage for the system under test makes it much simpler to initialize with a known good state.
  • We can create multiple event streams to represent different "weather conditions." E.g., a product launch scenario versus just a heavy shopping day.
  • The event stream doesn't need to be overly realistic. It's often better to have a naive transition model informed by a handful of useful parameters. Generate lots of data volume instead of finely modeled scenarios.
  • When there are failures, they are usually very obvious from the assertions. The tougher part is figuring out how a bad final state happened. More output in the test record is better.
  • If you squint, this looks like a load testing model. While it's possible to use this for load testing, it can be awkward. We've found a hybrid approach to be better: run a load generator for the "bulk traffic," then use simulation testing to check correctness under high load conditions. Of course, you sacrifice repeatability this way.