Don't get me wrong, I love unit testing. The practice of unit testing
is probably the most important quality innovation in my whole
career. Unit testing has spread beyond the agile development community
where it started into the mainstream, and we are all better off for
We need to be aware of some serious limitations, though. For example,
"Out of the Tar Pit",
Moseley and Marks say:
The key problem with testing is that a test (of any kind) that uses
one particular set of inputs tells you nothing at all about the
behaviour of the system or component when it is given a different set
of inputs. The huge number of different possible inputs usually rules
out the possibility of testing them all, hence the unavoidable
concern with testing will always be, "have you performed the right
tests?" The only certain answer you will ever get to this question is
an answer in the negative --- when the system breaks.
We can only write unit tests with a certain number of input cases. Too
few, and you miss an important edge case. Too many, and the cost of
maintaining the tests themselves becomes onerous.
Worse yet, we know that unit tests are inadequate when we need to
test overall system properties, in the presence of GUIs, and when
concurrency is involved.
So here are four testing strategies that each supplement unit tests
with more ways to gain confidence in your fully assembled system.
Automated Contract Testing
Automated Contract Testing uses a data-oriented specification of a
service to help with two key tasks:
- Exercise the service and verify that it adheres to its invariants.
- Simulate the service for development purposes.
It looks like this:
Some things to consider when using this type of testing:
- The contract model should be written by a consumer to express only
the parts of the service interface they care about. (If you
overspecify by modeling things you don't actually use, then your
tests will throw false negatives.)
- The supplier should not write the contract. Consumers write
models that express their desired interface partly to help validate
their understanding of the protocol. They may also uncover cases
that are in the supplier's blind spot.
- The test double should not make any assumptions about the logical
consistency of the results with respect to the parameters. You
should only be testing the application code and the way it deals
with the protocol.
- E.g., if you are testing an "add to cart" interface, do not verify
that the item you requested to add was the one actually added. That
is coupling to the implementation logic of the back end
service. Instead, simply verify that the service accepts a
well-formed request and returns a well-formed result.
A nice library to help with this style of test is
Property-based testing is a derivative of the "formal specifications"
ideas. It uses a model of the system to describe the allowed inputs,
outputs, and state transitions. Then it randomly (but repeatably)
generates a vast number of test cases to exercise the system. Instead
of looking for success, property-based testing looks for failures. It
detects states and values that could not have been produced according
to the laws of the model, and flags those cases as failures.
Property-based testing looks like this:
The canonical property-based testing tool is Quickcheck. Many partial
and fragmentary open-source tools claim to be "quickcheck clones" but
they lack two really important parts: search-space optimization and
failure minimization. Search-space optimization uses features of the
model to probe the "most important" cases rather than using a sheer
brute-force approach. Failure minimization is an important technique
for making the failures useful. Upon finding a failing case,
minimization kicks in and searches for the simplest case that
recreates the failure. Without it, understanding a failure case is
just about as hard as debugging an end-user defect report.
Considerations when using property-based testing:
- Specifying the model is a specialized coding task. The model is
often 10% of the size of the original system, but can still be very
- Test failures sometimes indicate a problem in the system and
sometimes a problem in the model. Most business systems are not very
well specified (in the rigorous CS sense of the term) and suffer
from many edge cases. Even formal standards from international
committees can be riddled with ambiguities and errors.
- If the system under test is not self-contained, then non-repeatable
test failures can confuse the framework. Test isolation (e.g., mocks
for all integration points) is really essential.
- This is not a cheap approach. Developing a robust model can take
many months. If a company commits to this approach, then the model
itself becomes a major investment. (In some ways, the model will be
more of an asset than the code base, since it specifies the system
behavior independently of any implementation technology!)
Fault Injection is pretty much what it sounds like. You run the system
under test in a controlled environment, then force "bad things" to
happen. These days, "bad things" mostly means network problems and
hacking attacks. I'll focus on the network problems for now.
One particular fault injection tool that has some interesting results
lately is Jepsen. Jepsen's author, Kyle Kingsbury, has been able to
demonstrate data loss in all of the current crop of eventually consistent NoSQL
databases. You can clone that repo to duplicate his results.
It looks like this:
Jepsen itself runs a bunch of VMs, then generates load. While the load
is running against the system, Jepsen can introduce partitions and
delays into the virtual network interfaces. By introducing controlled
faults and delays in the network, Jepsen lets us try out conditions
that can happen "in the wild" and see how the system behaves.
After running the test scenario, we use a validator to detect
- Jepsen itself doesn't provide much help for validation. As
delivered, it just tries to store a monotonic sequence of
integers. For application-specific tests, you must write a separate
- Generating load needs to be predictable enough to verify the data
and messages out from the system. That either means scripts or
pseudo-random cases with a controlled seed.
- This is another test method that can't prove success, but can detect
Simulation testing is the most repeatable
of these methods. In simulation testing, we use a traffic model to
generate a large volume of plausible "actions" for the system. Instead
of just running those actions, though, we store them in a database.
The activity model is typically a small number of parameters to describe things
like distribution of user types, ratio of logged-in to not-logged-in
users, likelihood of new registrations, and so on. We use these
parameters to create a database of actions to be executed later.
The event stream database will be reused for many different test runs,
so we want to keep track of which version of the model and
event generator were used to create it. This will be a recurring
pattern with simulation testing: we always know the provenance of the
The simulation runner then executes the actions against the system under
test. The system under test must be initialized with a known, versioned data
set. (We'll also record the version of starting dataset that was
used.) Because this runner is a separate program, we can turn a dial
to control how fast the simulation runs. We can go from real time, to
double speed, to one-tenth speed.
Where most test methods would verify the system output immediately,
simulation testing actually just captures everything in yet another
database. This database of outputs includes the final dataset when the
event stream was completed, plus all the outputs generated by the
system during the simulation. (Bear in mind that this "database" could
just be log files.) These are the normal outputs of the system, just captured permanently.
Like everything else here, the output database is versioned. All the
parameters and versions are recorded in the test record. That way we
can tie a test report to exactly the inputs that created it.
Speaking of the test report, we run validations against the resulting
dataset as a final process. Validations may be boolean checks for correctness (is the resulting value what was expected?) but they may also verify global properties. Did the system process all inputs within the allowed time? Were all the inputs accepted and acted on? Does "money in" balance with "money out"? These global properties cannot be validated during the simulation run.
An interesting feature of this model is that validations don't all need to exist when you run the simulation. We can think of new things to check long after the test run. For example, on finding a bug, we can create an probe for that bug and then find out how far back it goes.
Simulant is a tool for building
- This approach has some similarities to property-based testing. There
are two major differences:
- Property-based testing has a complete formal model of the system
used for both case generation and invariant checking. In
contrast, simulation testing uses a much simpler model of
incoming traffic and a separate set of invariants. Splitting
these two things makes each one simpler.
- Property-based testing runs its assertions immediately. To add
new assertions, you must re-run the whole test.
- Because assertions are run after the fact, you can verify global
properties about the system. For example, does "money in" balance
with "money out"?
- Using virtual machines and storage for the system under test makes
it much simpler to initialize with a known good state.
- We can create multiple event streams to represent different "weather
conditions." E.g., a product launch scenario versus just a heavy
- The event stream doesn't need to be overly realistic. It's often
better to have a naive transition model informed by a handful of
useful parameters. Generate lots of data volume instead of finely
- When there are failures, they are usually very obvious from the
assertions. The tougher part is figuring out how a bad final state
happened. More output in the test record is better.
- If you squint, this looks like a load testing model. While it's possible
to use this for load testing, it can be awkward. We've found a hybrid
approach to be better: run a load generator for the "bulk traffic," then use
simulation testing to check correctness under high load conditions. Of course,
you sacrifice repeatability this way.