Automated unit testing in a few lines of code
Automated, or unit, testing, can really help reduce bugs and increase productivity. Yet many data scientists don’t use it. When I talk to people about testing, they almost always agree that it sounds useful. But they will also often say things like:
I know it would help me, I just don’t have the time to do it.
Yes, we should definitely do this, but we don’t have the resources.
I want to use automated tests, but it seems difficult.
The benefits, such as catching bugs earlier, having fewer regressions, and increased velocity, should quickly pay off the investment required to get started using automated testing.
But in this article I also want to show you just how small that initial investment is, and just how easy it is to get started with automated testing.
I will show you how you can go from zero to testing in just a few lines of code and just a few minutes.
The example is in Python, but, the same approach applies in other languages as well.
Before we start, let’s just define what we mean by automated testing. Automated tests are pieces of code that check the behavior and result of a given piece of software. Checking the behavior of code is fundamental to any programming, data science included. It’s what we do every time we run some code and look at the output. But manual inspection is slow, error-prone, and, in larger doses, tedious. Automated testing means writing down those manual checks in code, so that instead of going through them manually, we can automatically run all our checks, as often as we want. Sometimes these tests are called unit tests, because each test checks a unit of code.
Automated testing for a new project
To start with automated testing for a new data science or analysis project, take the following steps
before writing any code. We will assume that the first step of the project is loading some data.
We will use the pytest
library, because it makes it very easy to run the tests. We will also use Pandas
.
The first step is to install pytest
, if you use the terminal you can run the following command:
python -m pip install pytest
If you don’t have pandas
installed, you can install it in the same way.
Next, we are going to write a test. Open a new python file called test_.py
(note the underscore _
, it’s there so that pytest
can find it):
import pandas as pd
import functions as f
def test_load_data():
= f.load_data()
result assert isinstance(result, pd.DataFrame)
The test test_load_data
tries to run the function load_data
from the module
functions
. It then tries to assert that result is a Pandas DataFrame
. If
the assertion is true, the test will pass, if it is false it will fail.
Many testing experts and aficionados would probably think this is a horrible test to start with – typically, unit tests are illustrated with small functions that take an argument and return a result, such as adding 1 to a number. My example here is chosen on purpose as the typical first step of a data science project. Testing the result of loading some data is not a true unit test, and, has some other downsides.
But, I want to show you how to get started with automated testing when doing data science or analysis. An imperfect, but practical, start is better than a perfect, but impractical start.
Now run this test, if you use the terminal, execute the following in the project directory:
pytest
This should execute pytest, which runs all the tests it can find (that is why we
called the file test_.py
, pytest will all files with the filename pattern
test_*.py
). In this case, it should find one test and run it, and that test
should fail.
Congratulations! you are now doing automated testing!
That is all there is to it.
Of course, that’s great, but how do you actually go on from here?
Automated testing work flow
Once you have the basic setup explained above, you apply automated testing in the following way:
Before writing a function for the next step of your analysis, write a test for the functionality. In the example above, we wrote a test for the function
load_data
, asserting that it returns a Pandas DataFrame, before even writing the function. Run the test to make sure it fails.Write the code that you think will make the test pass.
Run the test to see if it passes.
Keep iterating on the code until the test passes.
Once the test passes, you can either write a new test for the same function, or go on to the next piece of functionality.
Keep on doing these steps until you have working software.
This approach is sometimes called test-driven development, or TDD. Often, the recommendation is to refactor the code insted of just going on to the next function. But, for data science applications, the actual code is often quite simple, and, it’s often most important to first get to a working model. And the beauty of having automated tests is that they make refactoring safer and easier once you do decide to do it.
When applying automated testing to a data science workflow, the idea is to encode the checks you would typically do manually, by inspecting some output, into an assertion in an automated test. In the example above, we believe the load_data
function should return a Pandas DataFrame.
A test assertion is similar to a hypothesis, with the difference that we should always be able to write code to make the assertion true, which is unfortunately not true about scientific hypotheses…
Let’s try this workflow on our example. How can we make the test pass? In a file named functions.py
, write the following code:
import pandas as pd
def load_data():
return pd.read_csv("YOUR DATA FILE HERE")
Replace YOUR DATA FILE HERE
with the data you want to load. Run the tests
again. They should now pass.
Now that you have passing tests, you can go on to more tests and more functionality. For example, when you load some data you likely check the column names and types, the dimensions, that there are is no missing or bad data, etc. Try to write tests and code for these.
⁓