The problem with test coverage

<i>Est Reading Time: </i> 7 minutes<br><br>

“Test coverage is useless” 

Or…

“Test coverage is everything”

Two opposing extremes which I’ve heard recently and many times through my career. As with most things, I think the truth lies somewhere in between.

What is test coverage?

When different people hear test coverage they have there own opinions as to what that means. But it may be entirely different from the measurements that are in place! There are plenty of ways you are measure coverage. You can measure coverage via the requirements – i.e. percentage of acceptance criteria that are covered or feature coverage, but more commonly when it comes to automation, you measure coverage via the the code itself.

Let’s explore some of the common types of code coverage and the issues inherent within them.

Line coverage

You can simply measure the number of lines that are executed in the product code when you run your automated tests (line coverage). 

def foo(x, y):
    bar()
    if x == 1 and y > 1:
        bar()

So to get 100% statement coverage you just need one test:

  1. x = 1 and y > 1

What is the issue with line coverage?

If the lines have multiple decisions and multiple possibilities, you do not need to test them to ensure 100% coverage. For example in the above method, do we know what happens when y is less than 2? Maybe there’s a bug where this would still pass but we won’t know because we’ve only made one test. It might actually be that any combination of x and y would produce the same outcome! In this case the coverage statistic might cause us to write less automation than we would of simply because we’ve hit 100%. Can’t beat 100%, right?

Decision Coverage

You could measure the decision branches – i.e. wherever there is a decision made by something like an if statement or a switch statement you ensure that every possible branch of the calculation is covered.

def foo(x, y):
    bar()
    if x == 1 and y > 1:
        bar()
    else:
        baz()

So to cover the above code, you would need two tests:

  1. x = 1 and y > 1
  2. x != 1 and/or y <= 1

What is the issue with decision coverage?

Let’s consider the above code again. Is this scenario it’s a simple if statement, but what if we made the if statement more complicated, like below:

if (x == 1 and y > 1) or z % 2 == 0: 

Now we are adding another layer of complexity where we keep the old check but also go in to this branch if “z” is an even number. We would really need to check this with our unit tests as we now have multiple logic possibilities that would cause a TRUE or FALSE result for this if statement. However the above change does not change our coverage percentage. This is because in branch coverage as long as each branch is covered once, then it counts. So our existing two tests will still ensure 100%. Again we are in the situation where we can get top marks while still potentially missing covering some actual logic.

Condition Coverage

This is a type of coverage where you ensure that every condition within your decision statements evaluates to TRUE and FALSE to count. Let’s consider our code again this time with the revised if statement:

def foo(x, y):
    bar()
    if (x == 1 and y > 1) or z % 2 == 0:
        bar()
    else:
        baz()

To get 100% coverage now we need to ensure that both atomic parts of the if statement evaluate to TRUE and FALSE. The atomic parts are the if statement logic broken up by the logical operators of and/or/not. So we have three atomic parts in our statement:

  1. x ==1
  2. y > 1
  3. z % 2 == 0

In order to get 100% coverage we need to have tests that make sure each part evaluates to TRUE and FALSE. However we can cover multiple atomic parts in each test, meaning we can still get 100% coverage with two tests:

  1. x = 1, y = 1, z = 1 (TRUE, FALSE, FALSE)
  2. x = 2, y = 2, z = 2 (FALSE, TRUE, TRUE)

What is the issue with condition coverage?

As you can see we have increased the complexity of the statement but still covered it with two tests. There are many more combinations we could and probably should try. We actually have no guarantee with this type of coverage that the block within the statement has been executed, just that the conditions themselves have been executed.

Common concerns

We’ve seen issues specific to certain types of code coverage, but there are also some common pitfalls with code coverage in general.

Quantity not quality

Test coverage of any form is able to tell you the amount of something, whether that is lines covered, requirements hit, decision branches explored or something else. However it does not tell you the quality of the checks you are performing. You may end up executing every line in your unit tests for example, but missing vital checks anyway.

Let’s say you need to check this method:

def multiply(number_one, number_two):
    return number_one * number_two

Well, that’s simple and we can get code coverage to 100% by doing a test that sends in 1 and 2 and ensures 2 is the response. Awesome! 100%! 

But wait… what happens if we pass in null for one or both of the numbers? What happens if we pass in zero? What happens with negative numbers? None of these cases are covered, but going with just test coverage will give us the false sense of security that everything is covered.

Playing the system

Another big issue with enforcing some kind of code coverage metrics the tendency for this to encourage undesirable behaviours in order to satisfy the metric. Unit tests which have no value is the perfect example of this. Let’s say I make a unit test that covers this code:

def foo(x, y):
    if x == 1 and y > 1:
        bar()

Really I’d want to stub the thing that calls bar() and ensure that the method is actually called when I want it to be. But I don’t _need_ to do that in the test to pass the unit test metric. 

Consider the following test method:

def test_bar_is_called_given_valid_details():
    foo(1, 2)
    assert(true)

Albeit this is a crude example, but here I couldn’t be bothered creating any kind of stub, so I’ve not bothered. This test would happily pass and will happily hit 100% coverage of the foo() method (well, for line and decision coverage). 

This is one example on one method and so it might be unlikely to happen but when you are writing lots of code and you are being forced to hit a measure of 100% or similar coverage this kind of behaviour becomes more and more commonplace when delivery pressures are ramped up.

So don’t bother with code coverage?

Some people would say this, but it isn’t what I would advise. Code coverage can absolutely be a useful metric when used in the right way.

Understand your metric

You need to be aware of what exactly you are measuring with your code coverage. Make a decision as a team what kind of coverage you want to measure and understand what that means and what is and is not included within the coverage.

An indicator, not a target

The bad behaviours and common problems often stem from enforcing a particular amount of coverage on to your code, for example failing the CI build if code coverage is below a specific number. Given the problems that come from this, there is not really any upside from using coverage in this way.

Instead, use it as an indicator that something might be an issue. If one area of code has 80% line coverage and another area of code has 20% then it’s probably worth taking a look at the area with lower coverage. There might be a good reason for this but the coverage metric can give you a handy indicator to look there in the first place.

So…

“Test coverage is useless” -> Rubbish

“Test coverage is everything” -> Rubbish

“Test coverage can be a useful metric when the measurements are well understood and it is used as an indicator of potential risk areas” -> Boring, but true

Share this post:

Leave any thoughts below!