Method

Once you have created your Schema it can be imported and used to develop a wrangling method, a complete, structured JSON file which describes all aspects of the wrangling process. There is no ‘magic’. Only what is defined in the method will be executed during transformation.

A method file can be shared, along with your input data, and anyone can import whyqd and validate your method to verify that your output data is the product of these inputs:

import whyqd as _w
method = _w.Method(source, directory=DIRECTORY, input_data=INPUT_DATA)

source is the full path to the schema you wish to use, and DIRECTORY will be your working directory for saved data, imports, working data and output.

INPUT_DATA is a list of filenames or file sources. This is optional at this stage, and you can add and edit your sources later. File sources can include URI’s.

Help

To get help, type:

method.help()
# or
method.help(option)

Where option can be any of:

"status"
"merge"
"structure"
"category"
"filter"

status will return the current method status, and your mostly likely next steps. The other options will return methodology, and output of that option’s result (if appropriate).

These are the steps to create a complete method:

Merge

merge will join, in order from right to left, your input data on a common column. You can modify your input data at any time. Note, however, that this will reset your status and require revalidation of all subsequent steps.

To add input data, where input_data is a filename / source, or list of filenames / sources:

method.add_input_data(input_data)

To remove input data, where id is the unique id for that input data:

method.remove_input_data(id)

To display a nicely-formatted output for review:

# Permits horizontal scroll-bar in Jupyter Notebook
from IPython.core.display import HTML
display(HTML("<style>pre { white-space: pre !important; }</style>"))

print(method.print_input_data())

Data id: c8944fed-4e8c-4cbd-807d-53fcc96b7018

====  ====================  ========================  =================================  =============================  =======================================================  =============================  ===========================
..  Account Start date      Current Rateable Value  Current Relief Award Start Date    Current Relief Type            Full Property Address                                    Primary Liable party name        Property Reference Number
====  ====================  ========================  =================================  =============================  =======================================================  =============================  ===========================
0  2003-05-14 00:00:00                       8600  2019-04-01 00:00:00                Retail Discount                Ground Floor, 25, Albert Road, Southsea, Hants, PO5 2SE  Personal details not supplied                 177500080710
1  2003-07-28 00:00:00                       9900  2005-04-01 00:00:00                Small Business Relief England  Ground Floor, 102, London Road, Portsmouth, PO2 0LZ      Personal details not supplied                 177504942310
2  2003-07-08 00:00:00                       6400  2005-04-01 00:00:00                Small Business Relief England  33, Festing Road, Southsea, Hants, PO4 0NG               Personal details not supplied                 177502823510
====  ====================  ========================  =================================  =============================  =======================================================  =============================  ===========================

Data id: a9ad7716-f777-4752-8627-dd6206bede65

====  ===================================  =================================  ========================  ================================================================  =======================================================  ===========================
..  Current Prop Exemption Start Date    Current Property Exemption Code      Current Rateable Value  Full Property Address                                             Primary Liable party name                                  Property Reference Number
====  ===================================  =================================  ========================  ================================================================  =======================================================  ===========================
0  2019-11-08 00:00:00                  LOW RV                                                  700  Advertising Right, 29 Albert Road, Portsmouth, PO5 2SE            Personal details not supplied                                           177512281010
1  2019-09-23 00:00:00                  INDUSTRIAL                                            11000  24, Ordnance Court, Ackworth Road, Portsmouth, PO3 5RZ            Personal details not supplied                                           177590107810
2  2019-09-13 00:00:00                  EPRI                                                  26500  Unit 12, Admiral Park, Airport Service Road, Portsmouth, PO3 5RQ  Legal & General Property Partners (Industrial Fund) Ltd                 177500058410
====  ===================================  =================================  ========================  ================================================================  =======================================================  ===========================

Data id: 1e5a165d-5e83-4eec-9781-d450a1d3f5f1

====  ====================  ========================  =========================================================================  ==========================================  =================
..  Account Start date      Current Rateable Value  Full Property Address                                                      Primary Liable party name                     Property ref no
====  ====================  ========================  =========================================================================  ==========================================  =================
0  2003-11-10 00:00:00                      37000  Unit 7b, The Pompey Centre, Dickinson Road, Southsea, Hants, PO4 8SH       City Electrical Factors  Ltd                     177200066910
1  2003-11-08 00:00:00                     594000  Express By Holiday Inn, The Plaza, Gunwharf Quays, Portsmouth, PO1 3FD     Kew Green Hotels (Portsmouth Lrg1) Limited       177209823010
2  1994-12-25 00:00:00                      13250  Unit 2cd, Shawcross Industrial Estate, Ackworth Road, Portsmouth, PO3 5JP  Personal details not supplied                    177500013310
====  ====================  ========================  =========================================================================  ==========================================  =================

Once you’re satisfied with your input_data, prepare an order_and_key list to define the merge order, and a unique key for merging. Each input data file needs to be defined in a list as a dict:

{id: input_data id, key: column_name for merge}

Run the merge by calling (and, optionally - if you need to overwrite an existing merge - setting overwrite_working=True):

method.merge(order_and_key, overwrite_working=True)

To view your existing input_data as a JSON output (or the print_input_data as above):

method.input_data

Structure

structure is the core of the wrangling process and is the step where you define the actions which must be performed to restructure your working data.

Create a list of methods of the form:

{
        "schema_field1": ["action", "column_name1", ["action", "column_name2"]],
        "schema_field2": ["action", "column_name1", "modifier", ["action", "column_name2"]],
}

The format for defining a structure is as follows, and - yes - this does permit you to create nested wrangling tasks:

[action, column_name, [action, column_name]]

e.g.:

["CATEGORISE", "+", ["ORDER", "column_1", "column_2"]]

This permits the creation of quite expressive wrangling structures from simple building blocks.

Every task structure must start with an action to describe what to do with the following terms. There are several “actions” which can be performed, and some require action modifiers:

  • NEW: Add in a new column, and populate it according to the value in the “new” constraint

  • RENAME: If only 1 item in list of source fields, then rename that field

  • ORDER: If > 1 item in list of source fields, pick the value from the column, replacing each value with one from the next in the order of the provided fields

  • ORDER_NEW: As in ORDER, but replacing each value with one associated with a newer “dateorder” constraint

    • MODIFIER: + between terms for source and source_date
  • ORDER_OLD: As in ORDER, but replacing each value with one associated with an older “dateorder” constraint

    • MODIFIER: + between terms for source and source_date
  • CALCULATE: Only if of “type” = “float64” (or which can be forced to float64)

    • MODIFIER: + or - before each term to define whether add or subtract
  • JOIN: Only if of “type” = “object”, join text with ” “.join()

  • CATEGORISE: Only if of “type” = “string”; look for associated constraint, “categorise” where True = keep a list of categories, False = set True if terms found in list

    • MODIFIER:

      • + before terms where column values to be classified as unique
      • - before terms where column values are treated as boolean

Category

Provide a list of categories of the form:

{
        "schema_field1": {
                "category_1": ["term1", "term2", "term3"],
                "category_2": ["term4", "term5", "term6"]
        }
}

The format for defining a category term as follows:

term_name::column_name

Get a list of available terms, and the categories for assignment, by calling:

method.category(field_name)

Once your data are prepared as above:

method.set_category(**category)

Filter

Set date filters on any date-type fields. whyqd offers only rudimentary post- wrangling functionality. Filters are there to, for example, facilitate importing data outside the bounds of a previous import.

This is also an optional step. By default, if no filters are present, the transformed output will include ALL data. Parameters for filtering:

  • field_name: Name of field on which filters to be set
  • filter_name: Name of filter type from the list of valid filter names
  • filter_date: A date in the format specified by the field type
  • foreign_field: Name of field to which filter will be applied. Defaults to field_name

There are four filter_names:

  • ALL: default, import all data
  • LATEST: only the latest date
  • BEFORE: before a specified date
  • AFTER: after a specified date

BEFORE and AFTER take an optional foreign_field term for filtering on that column. e.g:

method.set_filter("occupation_state_date", "AFTER", "2019-09-01", "ba_ref")

Filters references in column ba_ref by dates in column occupation_state_date after 2019-09-01.

Validation

Each step can be validated and, once all steps validate, you can move to transformation of your data:

method.validate_input_data
method.validate_merge_data
method.validate_merge
method.validate_structure
method.validate_category
method.validate_filter

Or, to run all the above and complete the method (setting status to ‘Ready to Transform’):

method.validate

Transform

Transformation requires only the following:

method.transform()
method.save(DIRECTORY, filename=FILENAME, overwrite=True)

With one little permutation … if you’ve ever created a transform before, you’ll need to deliberately tell the function to overwrite your original:

method.transform(overwrite_output=True)

Citation

whyqd is designed for sharing. Add information you wish to be cited to a constructor field in the method.

The constructor field is there to store any metadata you wish to add. Whether it be Dublin Core or SDMX, add that metadata by creating a dictionary and placing it in the constructor.

A citation is a special set of fields, with options for:

  • authors: a list of author names in the format, and order, you wish to reference them
  • date: publication date
  • title: a text field for the full study title
  • repository: the organisation, or distributor, responsible for hosting your data (and your method file)
  • doi: the persistent DOI for your repository

Those of you familiar with Dataverse’s universal numerical fingerprint may be wondering where it is? whyqd, similarly, produces a unique hash for each datasource, including inputs, working data, and outputs. Ours is based on BLAKE2b and is sufficiently universally available as to ensure you can run this as required.

As an example:

citation = {
        "authors": ["Gavin Chait"],
        "date": "2020-02-18",
        "title": "Portsmouth City Council normalised database of commercial ratepayers",
        "repository": "Github.com"
}
method.set_constructors({"citation": citation})
method.save(DIRECTORY, filename=FILENAME, overwrite=True)

You can then get your citation report:

method.citation

Gavin Chait, 2020-02-18, Portsmouth City Council normalised database of commercial ratepayers,
Github.com, 1367d4f02c99030f6645389141b85a93d54c226b435fb1b5a6cbccd7f703687e442a011f62c1381793a2d3fbf13cc52c176e0c5c573008991134658759eef948,
[input sources:
https://www.portsmouth.gov.uk/ext/documents-external/biz-ndr-properties-january-2020.xls,
476089d8f37581613344873068d6e94f8cd63a1a64b421edf374a2b341bc7563aff03b86db4d3fec8ca90ce150ba1e531e3ff0d374f932d13fc103fd709e01bd;
https://www.portsmouth.gov.uk/ext/documents-external/biz-ndr-reliefs-january-2020.xls,
892ec5b6e9b1f68e0b371bbaed8d93095d57f2b656753af2b279aee17b5854c5e9d731b2795aac285d7f7d9f5991311bc8fae0cfe5446a47163f30f0314cac06;
https://www.portsmouth.gov.uk/ext/documents-external/biz-empty-commercial-properties-january-2020.xls,
a41b4eb629c249fd59e6816d10d113bf2b9594c7dd7f9a61a82333a8a41bf07e59f9104eb3c1dc4269607de5a4a12eaf3215d0afc7545fdb1dfe7fe1bf5e0d29]