Method

The Method class defines a wrangling process to restructure input data into a format defined by a Schema.

class method.Method(source=None, **kwargs)

Create and manage a method to perform a wrangling process.

Parameters:
  • source (path to a json file containing a saved schema, default is None) –
  • directory (working path for creating methods, interim data files and final output) –
  • kwargs (a schema defined as a dictionary, or default blank dictionary) –
merge(order_and_key=None, overwrite_working=False)

Merge input data on a key column.

Parameters:
  • order_and_key (list) – List of dictionaries specifying input_data order and key for merge. Can also use order_and_key_input_data directly. Each dict in the list has {id: input_data id, key: column_name for merge}
  • overwrite_working (bool) – Permission to overwrite existing working data
  • TO_DO
  • -----
  • merge validates column uniqueness prior to merge, if the column is not unique there (While) –
  • nothing the user can do about it (without manually going and refixing the input data) (is) –
  • sort of uniqueness fix required (probably using the filters) (Some) –
structure(name)

Return a ‘markdown’ version of the formal structure for a specific name field.

Returns:(nested) strings in structure format
Return type:list
set_structure(**kwargs)

Receive a list of methods of the form:

{
        "schema_field1": ["action", "column_name1", ["action", "column_name2"]],
        "schema_field2": ["action", "column_name1", "modifier", ["action", "column_name2"]],
}

The format for defining a structure is as follows:

[action, column_name, [action, column_name]]

e.g.:

["CATEGORISE", "+", ["ORDER", "column_1", "column_2"]]

This permits the creation of quite expressive wrangling structures from simple building blocks.

Every task structure must start with an action to describe what to do with the following terms. There are several “actions” which can be performed, and some require action modifiers:

  • NEW: Add in a new column, and populate it according to the value in the “new” constraint

  • RENAME: If only 1 item in list of source fields, then rename that field

  • ORDER: If > 1 item in list of source fields, pick the value from the column, replacing each value with one from the next in the order of the provided fields

  • ORDER_NEW: As in ORDER, but replacing each value with one associated with a newer “dateorder” constraint

    • MODIFIER: + between terms for source and source_date
  • ORDER_OLD: As in ORDER, but replacing each value with one associated with an older “dateorder” constraint

    • MODIFIER: + between terms for source and source_date
  • CALCULATE: Only if of “type” = “float64” (or which can be forced to float64)

    • MODIFIER: + or - before each term to define whether add or subtract
  • JOIN: Only if of “type” = “object”, join text with ” “.join()

  • CATEGORISE: Only if of “type” = “string”; look for associated constraint, “categorise” where True = keep a list of categories, False = set True if terms found in list

    • MODIFIER:

      • + before terms where column values to be classified as unique
      • - before terms where column values are treated as boolean
kwargs: dict
Where key is schema target field and value is list defining the structure action
category(name)

Return a ‘markdown’ version of assigned and unassigned category inputs for a named field of the form:

{
        "categories": ["category_1", "category_2"]
        "assigned": {
                "category_1": ["term1", "term2", "term3"],
                "category_2": ["term4", "term5", "term6"]
        },
        "unassigned": ["term1", "term2", "term3"]
}

The format for defining a category term as follows:

`term_name::column_name`
Returns:
Return type:list of (nested) strings in structure format
set_category(**kwargs)

Receive a list of categories of the form:

{
        "schema_field1": {
                "category_1": ["term1", "term2", "term3"],
                "category_2": ["term4", "term5", "term6"]
        }
}

The format for defining a category term as follows:

`term_name::column_name`
filter(name)

Return the filter settings for a named field. If there are no filter settings, return None.

Raises:TypeError if setting a filter on this field type is not permitted.
Returns:
Return type:dict of filter settings, or None
set_filter(field_name, filter_name, filter_date=None, foreign_field=None)

Sets the filter settings for a named field after validating all parameters.

Note

filters can only be set on date-type fields. whyqd offers only rudimentary post-

wrangling functionality. Filters are there to, for example, facilitate importing data outside the bounds of a previous import.

This is also an optional step. By default, if no filters are present, the transformed output will include ALL data.

Parameters:
  • field_name (str) – Name of field on which filters to be set
  • filter_name (str) – Name of filter type from the list of valid filter names
  • filter_date (str (optional)) – A date in the format specified by the field type
  • foreign_field (str (optional)) – Name of field to which filter will be applied. Defaults to field_name
Raises:
  • TypeError if setting a filter on this field type is not permitted.
  • ValueError for any validation failures.
transform(overwrite_output=False, filetype='csv')

Implement the method to transform input data into output data.

Parameters:
  • overwrite_output (bool) – Permission to overwrite existing output data
  • filetype (str) – Must be in ‘xlsx’ or ‘csv’. Default, ‘csv’.
citation

Present a citation and validation report for this method. If citation data has been included in the constructor then that will be included.

A citation is a special set of fields, with options for:

  • authors: a list of author names in the format, and order, you wish to reference them
  • date: publication date (uses transformation date, if not provided)
  • title: a text field for the full study title
  • repository: the organisation, or distributor, responsible for hosting your data (and your method file)
  • doi: the persistent DOI for your repository

Format for citation is:

author/s, date, title, repository, doi, hash (for output data), [input sources: URI, hash]
Returns:Text ready for citation.
Return type:str
constructors

Constructors are additional metadata to be included with the method. Ordinarily, this is a dictionary of key:value pairs defining any metadata that may be used post-wrangling and need to be maintained with the target data.

set_constructors(constructors, overwrite=False)

Define additional metadata to be included with the method.

Citation data must be specifically included as:

{
“citation”: {
“authors”: [“Author Name 1”, “Author Name 2”], “title”: “Citation Title”, “repository”: “Data distributor”, “doi”: “Persistent URI”

}

}

Parameters:
  • constructors (dict) – A set of key:value pairs. These will not be validated, or used during transformation.
  • overwrite (boolean) – To overwrite any existing data in the constructor, set to True
Raises:

TypeError if not a dict.

reset_data_checksums(reset_status=False, reset_output_only=False)

If input or working data are modified, then the checksums for working and output data must be deleted (i.e. they’re no longer valid and everything else must be re-run).

Parameters:
  • reset_status (bool) – Requires a deliberate choice. Default False.
  • reset_output_only (bool) – Requires a deliberate choice. Default False. Only resets output data.
input_dataframe(_id, do_morph=True)

Return dataframe of a specified input_data source. Perform the current morph method.

Parameters:
  • _id (str) – Unique id for an input data source. View all input data from input_data
  • do_morph (boolean, default True) – Perform the current morph method.
Returns:

Return type:

DataFrame

add_input_data(input_data, reset_status=False)

Provide a list of strings, each the filename of input data for wrangling.

Parameters:
  • input_data (str or list of str) – Each input data can be a filename, or a file_source (where filename is remote)
  • reset_status (bool) – Requires a deliberate choice. Default False.
Raises:

TypeError if not a list of str.

remove_input_data(_id, reset_status=False)

Remove an input data source defined by a source _id. If data have already been merged, reset data processing, or raise an error.

Parameters:
  • _id (str) – Unique id for an input data source. View all input data from input_data
  • reset_status (bool) – Requires a deliberate choice. Default False.
Raises:

TypeError if not a list of str.

reset_input_data_morph(_id, empty=False)

Wrapper around reset_morph. Reset list of morph methods to base. Automatically adds DEBLANK and DEDUPE unless empty=True.

Parameters:
  • _id (str) – Unique id for an input data source. View all input data from input_data
  • empty (boolean) – Start with an empty morph method. Default False.
add_input_data_morph(_id, new_morph=None)

Wrapper around add_morph. Append a new morph method defined by new_morph to morph_methods, ensuring that the first term is a morph, and that the subsequent terms conform to that morph’s validation requirements.

The format for defining a new_morph is as follows:

[morph, rows, columns, column_names]

e.g.:

["REBASE", [2]]
Parameters:
  • _id (str) – Unique id for an input data source. View all input data from input_data
  • new_morph (list) – Each parameter list must start with a morph, with subsequent terms conforming to the requirements for that morph.
delete_input_data_morph(_id, morph_id)

Wrapper around delete_morph. Delete morph method defined by morph_id.

Parameters:
  • _id (str) – Unique id for an input data source. View all input data from input_data
  • morph_id (str) – Unique id for morph method. View all morph methods from input_data_morphs.
reorder_input_data_morph(_id, order)

Wrapper around reorder_morph. Reorder morph methods defined by order.

Parameters:
  • _id (str) – Unique id for an input data source. View all input data from input_data
  • order (list) – List of id strings.
input_data_morphs(_id)

Wrapper around get_morph_markup. Return a markup version of a formal morph method. Useful for re-ordering methods.

Parameters:_id (str) – Unique id for an input data source. View all input data from input_data
Returns:
Return type:list of dicts
default_morph_types

Default list of morphs available to transform tabular data. Returns only a list of types. Details for individual default morphs can be returned with default_morph_settings.

Returns:
Return type:list
default_morph_settings(morph)

Get the default settings available for a specific morph type.

Parameters:morph (string) – A specific term for an morph type (as listed in default_morph_types).
Returns:
Return type:dict, or empty dict if no such morph_type
add_morph(df=Empty DataFrame Columns: [] Index: [], new_morph=None, morph_methods=None)

Append a new morph method defined by new_morph to morph_methods, ensuring that the first term is a morph, and that the subsequent terms conform to that morph’s validation requirements.

The format for defining a new_morph is as follows:

[morph, rows, columns, column_names]

e.g.:

["REBASE", [2]]
Parameters:
  • df (dataframe) – DataFrame must be explicitly provided.
  • new_morph (list) – Each parameter list must start with a morph, with subsequent terms conforming to the requirements for that morph.
  • morph_methods (list of morphs) – Existing morph methods. If None provided, creates a new list.
delete_morph_id(_id, morph_methods)

Delete morph method by id.

Parameters:
  • morph_methods (list of dicts of morphs) – Existing morph methods.
  • _id (string) –
reset_morph(empty=False)

Reset list of morph methods to base. Automatically adds DEBLANK and DEDUPE unless empty=True.

Parameters:empty (boolean) – Start with an empty morph method. Default False.
reorder_morph(morph_methods, order)

Reorder morph methods.

Parameters:order (list) – List of id strings.
Raises:ValueError if not all ids in list, or duplicates in list.
get_morph_markup(morph_methods)

Return a markup version of a formal morph method. Useful for re-ordering methods.

Returns:
Return type:list of dicts
perform_merge

Helper function to perform the merge. Also used by merge_validation step.

Returns:Merged dataframe derived from input_data
Return type:DataFrame
order_and_key_input_data(*order_and_key)

Reorder a list of input_data prior to merging, and add in the merge keys.

Parameters:order_and_key (list of dicts) – Each dict in the list has {id: input_data id, key: column_name for merge}
Raises:ValueError not all input_data are assigned an order and key.
deduplicate_columns(idx, fmt=None, ignoreFirst=True)

Source: https://stackoverflow.com/a/55405151 Returns a new column list permitting deduplication of dataframes which may result from merge.

Parameters:
  • idx (df.columns (i.e. the indexed column list)) –
  • fmt (A string format that receives two arguments: name and a counter. By default: fmt='%s.%03d') –
  • ignoreFirst (Disable/enable postfixing of first element.) –
Returns:

Updated column names

Return type:

list of strings

add_working_data_morph(new_morph=None)

Wrapper around add_morph. Append a new morph method defined by new_morph to morph_methods, ensuring that the first term is a morph, and that the subsequent terms conform to that morph’s validation requirements.

The format for defining a new_morph is as follows:

[morph, rows, columns, column_names]

e.g.:

["REBASE", [2]]
Parameters:new_morph (list) – Each parameter list must start with a morph, with subsequent terms conforming to the requirements for that morph.
delete_working_data_morph(_id)

Wrapper around delete_morph. Delete morph method defined by morph_id.

Parameters:_id (str) – Unique id for morph method. View all morph methods from working_data_morphs.
reorder_working_data_morph(order)

Wrapper around reorder_morph. Reorder morph methods defined by order.

Parameters:order (list) – List of id strings.
working_data_morphs()

Wrapper around get_morph_markup. Return a markup version of a formal morph method. Useful for re-ordering methods.

Parameters:_id (str) – Unique id for an input data source. View all input data from input_data
Returns:
Return type:list of dicts
set_field_structure_categories(modifier, column)

If a structure action is CATEGORISE, then specify the terms available for categorisation. Each field must have a modifier, including the first (e.g. +A -B +C).

The modifier is one of:

  • -: presence/absence of column values as true/false for a specific term
  • +: unique terms in the field must be matched to the unique terms defined by the

field constraints

As with set_structure, the recursive step of managing nested structures is left to the calling function.

Parameters:
  • modifier (str) – One of - or +
  • column (str) – Must be a valid column from working_column_list
set_field_structure(structure_list)

A recursive function which traverses a list defined by *structure, ensuring that the first term is an action, and that the subsequent terms conform to that action’s requirements. Nested structures are permitted.

The format for defining a structure is as follows:

[action, column_name, [action, column_name]]

e.g.:

["CATEGORISE", "+", ["ORDER", "column_1", "column_2"]]

This permits the creation of quite expressive wrangling structures from simple building blocks.

Parameters:structure_list (list) – Each structure list must start with an action, with subsequent terms conforming to the requirements for that action. Nested actions defined by nested lists.
default_action_types

Default list of actions available to define methods. Returns only a list of types. Details for individual default actions can be returned with default_action_settings.

Returns:
Return type:list
default_action_settings(action)

Get the default settings available for a specific action type.

Parameters:action (string) – A specific term for an action type (as listed in default_action_types).
Returns:
Return type:dict, or empty dict if no such action_type
build_structure_markdown(structure)

Recursive function that iteratively builds a markdown version of a formal field structure.

Returns:
Return type:list
build_category_markdown(category_input)

Converts category_terms dict into a markdown format:

["term1", "term2", "term3"]

Where the format for defining a category term as follows:

`term_name::column_name`

From:

{
        "column": "column_name",
        "terms": [
                "term_1",
                "term_2",
                "term_3"
        ]
}
morph_transform(df, morph_methods=None)

Performs the morph transforms on a DataFrame. Assumes parameters have been validated.

Parameters:
  • df (dataframe) – DataFrame must be explicitly provided.
  • morph_methods (list of morphs) – Existing morph methods.
Returns:

Containing the implementation of all morph transformations

Return type:

Dataframe

action_transform(df, field_name, structure, **kwargs)

A recursive transformation. A method should be a list fields upon which actions are applied, but each field may have nested sub-fields requiring their own actions. Before the action on the current field can be completed, it is necessary to perform the actions on each sub-field.

Parameters:
  • df (DataFrame) – Working data to be transformed
  • field_name (str) – Name of the target schema field
  • structure (list) – List of fields with restructuring action defined by term 0 (i.e. this action)
  • **kwargs – Other fields which may be required in custom transforms
Returns:

Containing the implementation of all nested transformations

Return type:

Dataframe

perform_transform

Helper function to perform the transformation. Also used by validate_transform step.

Returns:Transformed dataframe derived from method
Return type:DataFrame
validate_input_data

Test input data for checksum errors.

Raises:ValueError on checksum failure.
validate_merge_data

Test input data ready to merge; that it has a merge key, and that the data in that column are unique. Only required if there is more than one input_data source.

Raises:ValueError on uniqueness failure.
validate_merge

Validate merge output.

Raises:ValueError on checksum failure.
Returns:bool
Return type:True for validates
validate_structure

Method validates structure formats.

Raises:ValueError on structure failure.
Returns:bool
Return type:True for validates
validate_category

Method validates category terms.

Raises:ValueError on category failure.
Returns:bool
Return type:True for validates
validate_filter

Method validates filter terms.

Raises:ValueError on filter failure.
Returns:bool
Return type:True for validates
validate_transform

Validate output data.

Raises:ValueError on checksum failure.
Returns:bool
Return type:True for validates
validates

Method validates all steps. Sets READY_TRANSFORM if all pass.

Returns:bool
Return type:True for validates
build()

Build and validate the Method. Note, this replaces the Schema base-class.

save_data(df, filetype='xlsx', prefix=None)

Generate a unique filename for a dataframe, save to the working directory, and return the unique filename and data summary.

Parameters:
  • df (Pandas DataFrame) –
  • filetype (Save the dataframe as a particular type. Default is "CSV".) –
Returns:

Keys include: id, filename, checksum, df header, columns

Return type:

dict

save(directory=None, filename=None, overwrite=False, created_by=None)

Schema settings returned as a dictionary.

Parameters:
  • directory (the destination directory) –
  • filename (default to schema name) –
  • overwrite (bool, True if overwrite existing file) –
  • created_by (string, or None, to define the schema creator/updater) –
Raises:

ValueError if no filename

Returns:

Return type:

bool True if saved

help(option=None)

Get generic help, or help on a specific method.

option: str
Any of None, ‘status’, ‘merge’, ‘structure’, ‘category’, ‘filter’, ‘transform’, ‘error’.
Returns:
Return type:Help