Validate

When we talk about data probity, we’re referring to the following criteria:

  • Identifiable input source data,
  • Transparent methods for restructuring of that source data into the data used to support research analysis,
  • Accessible restructured data used to support research conclusions,
  • A repeatable, auditable curation process which produces the same data.

Researchers may disagree on conclusions derived from analytical results. What they should not have cause for disagreement on is the underlying data used to produce those analytical results.

That’s where validation comes in.

Perform validation

In the Method documentation, you saw how you can produce shareable output: your method.json file, your restructured_table.xlsx, and the original input source data.

These are all that’s required to validate your methods and output:

>>> import whyqd
>>> validate = whyqd.Validate(directory=DIRECTORY)
>>> validate.set(source=METHOD_SOURCE)
>>> validate.import_input_data(path=INPUT_DATA)
>>> validate.import_restructured_data(path=RESTRUCTURED_DATA)
>>> assert validate.validate()

Where:

  • DIRECTORY is your working directory to run the validation,
  • METHOD_SOURCE is the path to the method.json file,
  • INPUT_DATA is a list of paths to input source data,
  • RESTRUCTURED_DATA is the path to the restructured output data file.

What gets validated

The method contains a checksum - a hash based on BLAKE2b - is generated for each input file. These input data are never changed, and the hash is based on the entire file. If anyone opens these files and resaves them - even if they make no further changes - metadata and file structure will change, and a later hash generated on the changed file will be different from the original.

The check will fail.

whyqd rebuilds the restructured output table, following all the action scripts provided in the Method.

This produces a new restructued data table. This table’s content should be absolutely identical to that of the provided restructured data. whyqd hashes the content of this table (not the file) and compares these hashes.

If the provided restructured data, and the newly-generated data based on the method actions, both have the same checksum, then - whether you agree with the analysis, or methods, or not - we can state that these input data, with these restructuring actions applied does produce these restructured data.