I had an insightful experience playing with code coverage the other day. I was setting up nox to test code locally with pytest before pushing it. I decided as part of this setup to use coverage to give information on what lines of the source code are covered by tests.

This runs against my usual practice as I’ve never been a big fan of coverage reporting. Everyone acknowledges that good coverage is a necessary but not sufficient step to know that your code is tested. This is because a given piece of code must be run in order to be tested, but that piece of code could be operating with values that don’t capture edge cases, or even common problematic cases. Code coverage gives you a metric to optimize, yet the correlation between this metric and the quality of the tests can be questionable, and you can end up spending more time optimizing the metric rather than actually testing the code. Optimizing this metric becomes increasingly difficult to do as you converge to 100%, while the marginal usefulness of such tests drops to 0%.

As a basic example, your tests might directly call a module’s main() function, and as a result coverage complains that you never tested the part of the module to make it behave as a script:

if __name__ == '__main__':
   main()

Since it’s mostly an inefficient use of time to test that code in order to increase coverage, and because I know the bugs in the code are probably the lines of the code that were covered by tests but were done so inadequately, I’ve typically just ignored coverage in the past.

As another example, I raise a lot of conditional ValueErrors in my code in the name of defensive programming, and often these never get run. Coverage penalizes this and it can feel like a waste of time to contrive data that raises the ValueError just to raise a ValueError that I know will get raised if the data ever was an issue. There are more such examples of lines of code where it feels like the time spent improving coverage could be better spent writing real tests.

However, the other day I found good use with coverage. Rather than trying to maximize coverage I simply inspected the lines that weren’t covered. Instead of falling victim to a seemingly endless game of line coverage optimization, I observed with equanimity the lines that weren’t covered. Findings:

  • Sometimes this yielded dead code such as functions from times past that had become no longer are relevant and could be deleted.
  • Sometimes this indicated which broad paths through the code were not being handled by the test suite. Examples:
    • One of our models was being tested in inference mode, but not in training mode - all training tests were with another model. This doesn’t necessarily need to change but it’s good to know.
    • We had implemented several different ways of breaking large spans of text down into smaller ones to be fed to our natural language processing model. Some of them weren’t being tested. These approaches may indeed never get used but since they might I’d rather not delete the code. So I simply dropped some logging.warnings at such junctures. Rather than creating a test for code that never gets run, it’s a lot more time-efficient to raise a warning so that should the code ever get run we are informed that we should test it. (I tried at first to raise an exception to be more forceful and fail fast, but then pylint complained about unreachable code).
  • In cases where the uncovered code are conditionals that raise exceptions (and that aren’t really worth spending time making tests for) I simply got to observe, acknowledge that I don’t care, and move on.

To summarize, I just used coverage to inspect code that hadn’t been covered and it helped me delete dead code, flag some code with warnings, and have a better sense of what the test cases do. I do not intend to reach 100% code coverage but found it nonetheless useful to use a coverage tool and have revised my attitude towards them.