Data Validation

Data validation is intended to provide automated guarantees over the input data, validating whether the data are meaningful, correct, accurate, consistent or complete. In data analysis, we often expect certain assumption to be invariant over time regardless of how the dataset is manipulated. For example, we might expect a field to always be unique (identifier) or a number to range between one and five (ratings). Tight coupling causes the same problem as discussed above because the user often designs validation rules that are grounded in the meaning of the data.

A variety of software packages have been developed to facilitate this workflow 1, 2, 3, 4, 5, 6. These packages can benefit from decoupling semantic and machine type representations.

The part of data validation that depends on machine types, the validation methods, should have to focus of such a package. For example, the software could have helper functions to assert if a float is approximately the reference value (see pytest.approx in [pytest]). The other part, defining and checking properties of the data depends on semantic types. For example, the International Standard Book Number (ISBN) has a check digit, which only makes sense to validate when a stored number is representing the ISBN. A key observation is that this highly overlaps with the data summarization example discussed above. Large components of that application can be reused to reduce code complexity.

pytest

pytest x.y, 2004 Krekel et al., https://github.com/pytest-dev/pytest

Footnotes

1

https://github.com/great-expectations/great_expectations

2

https://github.com/zaxr/bulwark

3

https://github.com/engarde-dev/engarde

4

https://github.com/csparpa/fluentcheck

5

https://github.com/jmenglund/pandas-validation

6

https://github.com/TMiguelT/PandasSchema