Method

Whyqd supports trust in research by ensuring complete and unambiquous probity in the curation of all source data.

Data probity refers to the following criteria:

  • Identifiable input source data,
  • Transparent methods for restructuring of that source data into the data used to support research analysis,
  • Accessible restructured data used to support research conclusions,
  • A repeatable, auditable curation process which produces the same data.

Researchers may disagree on conclusions derived from analytical results. What they should not have cause for disagreement on is the probity of the underlying data used to produce those analytical results.

Once you have created your Schema it can be imported and used to develop a wrangling method, a complete, structured JSON file which describes all aspects of the wrangling process. There is no ‘magic’. Only what is defined in the method will be executed during transformation.

A method file can be shared, along with your input data, and anyone can then import whyqd and Validate your method to verify that your output data is the product of these inputs.

There are two worked tutorials to demonstrate how you can use whyqd to support source data curation transparency.

The first step in the process is simply to declare a method, and assign a schema which it must conform to:

>>> import whyqd
>>> SCHEMA = whyqd.Schema(source=SCHEMA_SOURCE)
>>> method = whyqd.Method(directory=DIRECTORY, schema=SCHEMA)
>>> method_details = {
    "name": "urban_population_method",
    "title": "Urban population method",
    "description": "Methods converting World Bank Urban Population source data into our analytical input data.",
}
>>> method.set(method_details)

Where:

  • DIRECTORY is your working directory to create your method and manage data files,
  • SCHEMA_SOURCE is the path to the schema.json file,

Greater detail can be found in the complete API reference. What follows are the high-level steps required to develop your method.

Import source data

Assuming you have a list of input source data you wish to restructure:

>>> INPUT_DATA = [
    SOURCE_DIRECTORY + "raw_E06000044_014_0.XLSX",
    SOURCE_DIRECTORY + "raw_E06000044_014_1.XLSX",
    SOURCE_DIRECTORY + "raw_E06000044_014_2.XLSX",
]

Data must conform to the DataSourceModel, which requires a minimum of a path field declaration:

>>> input_data = [{"path": d} for d in INPUT_DATA]
>>> method.add_data(source=input_data)

These data will be imported to your working DIRECTORY and a unique checksum assigned to each data. The checksum is a hash based on BLAKE2b. These input data are never changed during the restructuring process, and the hash is based on the entire file. If anyone opens these files and resaves them - even if they make no further changes - metadata and file structure will change, and a later hash generated on the changed file will be different from the original.

You now have two options:

  • Merge: since you have multiple data sources, you can merge these into one so that you only need to develop one** set of restructuring actions,
  • Add actions: or you can add individual actions, and then merge.

Warning

Whyqd ensures unambiguous data curation. There are no “stranded” assets in a method. If you import an Excel spreadsheet which happens to have multiple sheets, each of these sheets as added as a separate input_data reference. It is up to you to delete input data you have no intention of using.

Merge

merge will join, in order from right to left, your input data on a key column. Merging will generate a working_data reference.

Note

If you only have one input source file, you don’t need to merge. However, if you have multiple sources, then a merge is mandatory or your build will fail.

During import, each data source was assigned a unique reference UUID <https://en.wikipedia.org/wiki/Universally_unique_identifier>>_. If a single Excel source file had multiple sheets, then uniquely-identifying each is a combination of the UUID and its sheet_name.

To merge, you create a special MERGE action:

"MERGE < ['key_column'::'source_hex'::'sheet_name', etc.]"

Where, for each input data source:

  • key_column is the data column used to uniquely link each row of each data source,
  • source_hex is unique reference UUID for each data source,
  • sheet_name is sheet name within a multi-sheet Excel source file, if such exist (otherwise None).

You should know your own source data, and you can get the references as follows:

>>> merge_reference = [
        {"source_hex": method.get.input_data[2].uuid.hex, "key_column": "Property ref no"},
        {"source_hex": method.get.input_data[1].uuid.hex, "key_column": "Property Reference Number"},
        {"source_hex": method.get.input_data[0].uuid.hex, "key_column": "Property Reference Number"},
    ]

In this example, there is no sheet_name. Generate and add your merge script as follows:

>>> merge_terms = ", ".join([f"'{m['key_column']}'::'{m['source_hex']}'" for m in merge_reference])
>>> merge_script = f"MERGE < [{merge_terms}]"
>>> method.merge(merge_script)

whyqd will automatically process the merge, validate the merge works, and assign a UUID to the working_data.

Whether you needed to merge or not, you’re now ready to assign restructuring actions.

Restructure with Actions

Actions are the core of the wrangling process and is the step where you define individual steps which must be performed to restructure your data.

There are two main types of restructuring actions:

  • Schema-based where the result of the action is to restructure source data columns into schema columns,
  • Morph-based where you physically, and destructively, manipulate your source data to make it useable.

Morphs include actions like deletion, or rebasing the header row. These will lose information from your source. These two actions have a slightly different structure, but the process by which you add them to your method is the same.

Schema-based actions

A template schema-based script:

"ACTION > 'destination_column' < [modifier 'source_column', {action_script}]"

Where, for each input data source:

  • destination_column is the schema field, or existing column, you wish to direct the results of your action,
  • source_column is the existing column you wish to restructure,
  • modifier are special characters, defined by the ACTION, which modify the way the values in the source_colum are interpreted.
  • {action_script} is a nested action.

Schema-based actions can be nested. You can embed other schema-based actions inside them. In the case of CALCULATE, this may be necessary if you need to change the sign of a column of negative values:

"CALCULATE > 'destination_field' < [modifier 'source_column', modifier CALCULATE < [- 'source_column']"

Morph-based actions

A template morph-based script:

"ACTION > [columns] < [rows]"

Where:

  • rows are the specific rows effected by the morph, a list of int,
  • columns are the specific columns effected by the morph, a list of source_column.

Morph-based actions are not permitted to be nested, i.e. they are stand-alone actions.

Note

It is assumed that you’re not working ‘blind’, that you’re actually looking at your data while assigning actions - especially row-level actions - otherwise you are going to get extremely erratic results. Whyqd is built on Pandas and these examples lean heavily on that package.

Reviw Actions for each action’s documentation in the API to know how to work with them.

Assigning actions

Once you have reviewed your data, and worked through the script you need to produce, they can be assigned:

>>> schema_scripts = [
        "DEBLANK",
        "DEDUPE",
        "REBASE < [2]",
    ]
>>> source_data = method.get.input_data[0]
>>> method.add_actions(schema_scripts, source_data.uuid.hex, sheet_name = "Data")

Note, though, that “REBASE < [2]” changed the header-row column labels. Any further actions need to reference these names or the action will fail. You can review your progress at any point by running the transform and getting a Pandas DataFrame:

>>> df = method.transform(source_data)
  Country Name Country Code Indicator Name Indicator Code 1960.0 1961.0 1962.0 1963.0 1964.0 1965.0 1966.0 1967.0 1968.0 1969.0 1970.0 1971.0 1972.0 1973.0 1974.0 1975.0 1976.0 1977.0 1978.0 1979.0 1980.0 1981.0 1982.0 1983.0 1984.0 1985.0 1986.0 1987.0 1988.0 1989.0 1990.0 1991.0 1992.0 1993.0 1994.0 1995.0 1996.0 1997.0 1998.0 1999.0 2000.0 2001.0 2002.0 2003.0 2004.0 2005.0 2006.0 2007.0 2008.0 2009.0 2010.0 2011.0 2012.0 2013.0 2014.0 2015.0 2016.0 2017.0 2018.0 2019.0
3 Aruba ABW Urban population SP.URB.TOTL 27526 28141 28532 28761 28924 29082 29253 29416 29575 29738 29900 30082 30275 30470 30605 30661 30615 30495 30353 30282 30332 30560 30943 31365 31676 31762 31560 31142 30753 30720 31273 32507 34116 35953 37719 39172 40232 40970 41488 41945 42444 43048 43670 44246 44669 44889 44882 44686 44378 44053 43778 43819 44057 44348 44665 44979 45296 45616 45948 nan
4 Afghanistan AFG Urban population SP.URB.TOTL 755836 796272 839385 885228 934135 986074 1.04119e+06 1.09927e+06 1.16136e+06 1.22827e+06 1.30095e+06 1.37946e+06 1.46329e+06 1.55104e+06 1.64087e+06 1.73093e+06 1.82161e+06 1.91208e+06 1.99758e+06 2.07094e+06 2.13637e+06 2.18149e+06 2.20897e+06 2.22507e+06 2.24132e+06 2.2679e+06 2.30581e+06 2.35734e+06 2.43955e+06 2.50291e+06 2.62855e+06 2.82817e+06 3.09339e+06 3.39171e+06 3.67709e+06 3.91625e+06 4.09384e+06 4.22082e+06 4.32158e+06 4.43476e+06 4.5878e+06 4.79005e+06 5.03116e+06 5.29338e+06 5.5635e+06 5.82429e+06 6.05502e+06 6.26375e+06 6.46484e+06 6.68073e+06 6.92776e+06 7.21252e+06 7.52859e+06 7.86507e+06 8.20488e+06 8.53561e+06 8.85286e+06 9.16484e+06 9.4771e+06 nan
5 Angola AGO Urban population SP.URB.TOTL 569222 597288 628381 660180 691532 721552 749534 776116 804107 837758 881022 944294 1.0282e+06 1.12462e+06 1.23071e+06 1.34355e+06 1.4626e+06 1.58871e+06 1.72346e+06 1.86883e+06 2.02677e+06 2.19787e+06 2.38256e+06 2.58126e+06 2.79453e+06 3.02227e+06 3.26559e+06 3.5251e+06 3.8011e+06 4.09291e+06 4.40096e+06 4.72563e+06 5.06788e+06 5.42758e+06 5.80661e+06 6.15946e+06 6.53015e+06 6.919e+06 7.32807e+06 7.75842e+06 8.212e+06 8.68876e+06 9.19086e+06 9.72127e+06 1.02845e+07 1.08828e+07 1.14379e+07 1.20256e+07 1.26446e+07 1.32911e+07 1.39631e+07 1.46603e+07 1.53831e+07 1.61303e+07 1.69008e+07 1.76915e+07 1.85022e+07 1.93329e+07 2.01847e+07 nan
6 Albania ALB Urban population SP.URB.TOTL 493982 513592 530766 547928 565248 582374 599300 616687 635924 656733 677801 698647 720649 742333 764166 786668 809052 832109 854618 876974 902120 927513 954645 982645 1.01124e+06 1.04013e+06 1.0685e+06 1.09835e+06 1.12772e+06 1.16716e+06 1.19722e+06 1.19891e+06 1.20949e+06 1.21988e+06 1.23022e+06 1.2404e+06 1.25052e+06 1.26041e+06 1.27021e+06 1.27985e+06 1.28939e+06 1.29858e+06 1.32722e+06 1.35485e+06 1.38183e+06 1.4073e+06 1.43089e+06 1.4524e+06 1.47339e+06 1.49526e+06 1.51952e+06 1.54693e+06 1.57579e+06 1.6035e+06 1.63012e+06 1.6545e+06 1.68025e+06 1.70634e+06 1.72897e+06 nan

Wide format is not exactly helpful, so we’ll modify this:

>>> source_data = method.get.input_data[0]
>>> source_columns = [c.name for c in source_data.columns]
>>> schema_script = f"PIVOT_LONGER > {source_columns[4:]}"
>>> method.add_actions(schema_script, source_data.uuid.hex, sheet_name = "Data")
>>> df = method.transform(source_data)
  Country Code Indicator Name Indicator Code Country Name PIVOT_LONGER_names_idx_4 PIVOT_LONGER_values_idx_5
0 ABW Urban population SP.URB.TOTL Aruba 1960 27526
1 AFG Urban population SP.URB.TOTL Afghanistan 1960 755836
2 AGO Urban population SP.URB.TOTL Angola 1960 569222
3 ALB Urban population SP.URB.TOTL Albania 1960 493982

It may seem daunting at first, but the actions are designed to give you all the power of pandas while allowing you to focus on the complexities of restructuring your data.

Assigning categories

One of the problems with having a schema is that not everyone will agree to use it. Your source data can have a variety of terms - with a variety of different spellings - to refer to the same things. This can make an already-intimidating restructuring process a hair-tearing experience.

It takes two separate sets of actions to categorise and restructure categories.

  • First, identify the columns which contain values you wish to categories, and specify how to treat those values,
  • Second, assign the values in each column to the schema-defined categories.

The template script to extract unique terms in columns is:

"CATEGORISE > 'destination_field' < [modifier 'source_column', modifier 'source_column', etc.]"

Where there are two modifier terms:

  • - indicates that the presence or absence of values in the column are coerced to boolean, and
  • + indicates that specific values in the column are to be assigned to a defined schema category.

Once complete, you can get a list of the unique terms and assign them using one of the two assignment actions:

"ASSIGN_CATEGORY_BOOLEANS > 'destination_field'::bool < 'source_column'"

or:

"ASSIGN_CATEGORY_UNIQUES > 'destination_field'::'destination_category' < 'source_column'::['unique_source_term', 'unique_source_term', etc.]"

Where assignment terms include:

  • destination_field is a FieldModel and is the destination column. The :: linked CategoryModel defines what term the source values are to be assigned.
  • list of CategoryModel - unique values from ColumnModel - will be assigned ::CategoryModel.
  • Values from the source_column ColumnModel are treated as boolean True or False, defined by ::bool.

The way to think about assigning a boolean column is that these are columns with values and nulls. The presence or absence of values can be all you need to know, not the values themselves. For example, if you want to know who - in a list of employees - has taken leave but all you have is a date column of when they took leave, an absence of a date in that column indicates they haven’t yet taken any.

Getting a list of the unique terms in a column so you can assign them goes like this:

>>> df = method.transform(method.get.working_data)
>>> list(df["Current Relief Type"].unique())
[<NA>,
'Retail Discount',
'Small Business Relief England',
'Supporting Small Business Relief',
'Sbre Extension For 12 Months',
'Empty Property Rate Industrial',
'Empty Property Rate Non-Industrial',
'Mandatory',
'Sports Club (Registered CASC)',
'Empty Property Rate Charitable']

Note

You can assign unique values to a boolean category term (e.g. ‘Empty Property Rate Industrial’ and ‘Empty Property Rate Non-Industrial’ could be assigned to a occupation_status schema field, where True is occupied, and False is vacant). You can also assign booleans to a schema requiring unique fields. Use your intuiation and it’ll probably work the way you expect.

Here’s an example script to assign column values to a boolean schema field:

"CATEGORISE > 'occupation_state' < [+ 'Current Property Exemption Code', + 'Current Relief Type']"
"ASSIGN_CATEGORY_UNIQUES > 'occupation_state'::False < 'Current Relief Type'::['Empty Property Rate Non-Industrial', 'Empty Property Rate Industrial', 'Empty Property Rate Charitable']"

Adding these scripts works as above:

>>> source_data = method.get.working_data
>>> method.add_actions(schema_scripts, source_data.uuid.hex)

Assigning filters

Filtering is inherently destructive, reducing the number of rows in your source data. This can make your data more manageable, or help ensure only the latest data since a previous release, are included in an ongoing data series.

A standard script is:

"ACTION > 'filter_column'::'date' < 'source_column'"

Where:

  • filter_column: the specific column for filtering,
  • source_column: a group-by column for filtering the latest data only of a data series,
  • date: a specific date reference, in ISO YYYY-MM-DD format. Times are not filtered, so treat with caution if your filter requirements are time-based.

There are three filters: before a specified date, after a specified date, or latest for a specified group.

As example:

>>> filter_script = "FILTER_AFTER > 'occupation_state_date'::'2010-01-01'"
>>> method.add_actions(filter_script, source_data.uuid.hex)

Or you could do latest:

>>> filter_script = "FILTER_LATEST > 'occupation_state_date' < 'ba_ref'"
>>> method.add_actions(filter_script, source_data.uuid.hex)

Build

Performing the build is straightforward:

>>> method.build()
>>> method.save(created_by="Gavin Chait")

Build will automatically save your restructured output data as an Excel file (to preserve source field types). You can save your method as a json file by calling save. A version data update will automatically be added to the method, and you can add an optional created_by reference as well.

Validation

At each step of the transformation process - whether it be adding input data, merging or adding actions - whyqd performs validation, both for individual steps, and for the entire build process. This validation step exists only for the truly paranoid (as you rightly should be with input data you do not control).

>>> method.validate()
True

Citation

whyqd is designed to support a research process and ensure citation of the incredible work done by research-based data scientists.

A citation is a special set of fields, with options for:

  • author: The name(s) of the author(s) (in the case of more than one author, separated by and),
  • title: The title of the work,
  • url: The URL field is used to store the URL of a web page or FTP download. It is a non-standard BibTeX field,
  • publisher: The publisher’s name,
  • institution: The institution that was involved in the publishing, but not necessarily the publisher,
  • doi: The doi field is used to store the digital object identifier (DOI) of a journal article, conference paper, book chapter or book. It is a non-standard BibTeX field. It’s recommended to simply use the DOI, and not a DOI link,
  • month: The month of publication (or, if unpublished, the month of creation). Use three-letter abbreviation,
  • year: The year of publication (or, if unpublished, the year of creation),
  • note: Miscellaneous extra information.

Those of you familiar with Dataverse’s universal numerical fingerprint may be wondering where it is? whyqd, similarly, produces a unique hash for each datasource, including inputs, working data, and outputs. Ours is based on BLAKE2b and is included in the citation output.

As an example:

>>> citation = {
        "author": "Gavin Chait",
        "month": "feb",
        "year": 2020,
        "title": "Portsmouth City Council normalised database of commercial ratepayers",
        "url": "https://github.com/whythawk/whyqd/tree/master/tests/data"
    }
>>> method.set_citation(citation)

You can then get your citation report:

>>> method.get_citation()
    {'author': 'Gavin Chait',
    'title': 'Portsmouth City Council normalised database of commercial ratepayers',
    'url': AnyUrl('https://github.com/whythawk/whyqd/tree/master/tests/data', scheme='https', host='github.com', tld='com', host_type='domain', path='/whythawk/whyqd/tree/master/tests/data'),
    'month': 'feb',
    'year': 2020,
    'input_sources': [{'path': 'https://www.portsmouth.gov.uk/ext/documents-external/biz-ndr-properties-january-2020.xls',
    'checksum': 'b180bd9fe8c3b1025f433e0b3377fb9a738523b9c33eac5d62ed83c51883e1f64a3895edf0fc9e96a85a4130df3392177dff262963338971114aa4f5d1b0a70e'},
    {'path': 'https://www.portsmouth.gov.uk/ext/documents-external/biz-ndr-reliefs-january-2020.xls',
    'checksum': '98e23e4eac6782873492181d6e4f3fcf308f1bb0fc47dc582c3fdf031c020a651d9f06f6510b21405c3f63b8d576a93a27bd2f3cc5b053d8d9022c884b57d3a3'},
    {'path': 'https://www.portsmouth.gov.uk/ext/documents-external/biz-empty-commercial-properties-january-2020.xls',
    'checksum': '9fd3d0df6cc1e0e58ab481ca9d46b68150b3b8d0c97148a00417af16025ba066e29a35994d0e4526edb1deda4c10b703df8f0dbcc23421dd6c0c0fd1a4c6b01c'}],
    'restructured_data': {'path': 'https://github.com/whythawk/whyqd/tree/master/tests/data',
    'checksum': '25591827b9b5ad69780dc1eea6121b4ec79f10b62f21268368c7faa5ca473ef3f613c72fea723875669d0fe8aa57c9e7890f1df5b13f922fc525c82c1239eb42'}}