04: Pyspark testing interview Q&As with pytest tutorial

This extends Pyspark testing interview Q&As with unittest tutorial.

Q04: How would you go about writing init tests to test your pyspark transformations and functions?
A04: Here is a complete example using a popular python third-party testing library pytest. The pytest is a robust testing framework for Python that makes it easy to write simple and scalable test cases.

Step 1: Ensure pandas is installed via pip or pip3.

Step 2: Write a unittest python file named say test_spark_enrich.py to test the function join_languages_with_employees(..,..) we created in the previous tutorial.

Run it as:

When you run your test file with the pytest command, it will pick up all functions that have their name beginning with “test.”


You can create a fail scenario by modifying one of the data_expected values in the file test_spark_enrich.py. For example, change (“Golang”, “Beginner”, “Sam”) to (“Golang”, “Beginner”, “Same“) and rerun the test_spark_enrich.py:

You can also run it as:


Q05: What if you want to reuse the fake or mock DataFrames you created for testing in multiple tests?
A05: Here is the revised code that can reuse fake or mock data with pytest.

Q06: What are fixtures in pytest?
A06: Fixtures in pytest are reusable components that you define to set up a specific environment before a test runs and tear it down after it completes. Fixtures use decorator design pattern and provide a fixed baseline from which tests can be run with predictability & repeatability. Fixtures are very flexible and can be used for:

1) Creating a SparkSession to be shared among all the tests as shown above.
2) Creating fake or mock data to be shared among all the tests as shown above.
3) Setting up database tables that are to be used for testing.
4) Setting up system configurations for testing.
5) Cleaning up (aka tear down) after tests are run.

Fixtures can be scoped at function, class, module or session, which means you can set a fixture to be invoked once per test function, once per test class, once per module, or once per session, respectively. The default scope is a function. The scope controls how often a fixture gets set up and torn down.

Q07: When you run tests, how will you override a function’s behaviour?
A07: The keyword patch can be used to define what to mock.

The below code uses a context manager. Learn more about Python context manager.

In pyspark tests, you can replace a function like retrieving data from a table with mock data as shown below:

Mocking data is essential part of unit testing. There are scenarios where the keyword patch can be used as an annotation where it acts a decorator.

Java Developer & Architect Q&As

Big Data Engineer & Architect Q&As

16+ Key Areas & 13+ Techs to fast-track