Method

The Method class defines a wrangling process to restructure input data into a format defined by a Schema.

class whyqd.method.method.Method(directory: str, schema: Type[whyqd.schema.schema.Schema], method: Optional[whyqd.models.method_model.MethodModel] = None)

Create and manage a method to perform a wrangling process.

Parameters:
  • directory (str) – Working path for creating methods, interim data files and final output
  • source (str) – Path to a json file containing a saved schema, default is None
describe

Get the method name, title and description.

  • name: Term used for filename and referencing. Will be lower-cased and spaces replaced with _
  • title: Human-readable term used as name.
  • description: Detailed description for the method. Reference its objective and use-case.
Returns:
Return type:dict or None
get

Get the method model.

Returns:
Return type:MethodModel or None
set(method: whyqd.models.method_model.MethodModel) → None

Update or create the method.

Parameters:method (MethodModel) – A dictionary conforming to the MethodModel.
add_data(source: Union[str, List[str], whyqd.models.datasource_model.DataSourceModel, List[whyqd.models.datasource_model.DataSourceModel]]) → None

Provide either a path string, list of path strings, or a dictionary conforming to the DataSourceModel data for wrangling.

If conforming to the DataSourceModel, each source dictionary requires the minimum of:

{
    "path": "path/to/source/file"
}

An optional citation conforming to CitationModel can also be provided.

Parameters:source (str, list of str, DataSourceModel, or list of DataSourceModel) – A dictionary conforming to the DataSourceModel. Each path can be to a filename, or a url.
remove_data(uid: uuid.UUID, sheet_name: Optional[str] = None) → None

Remove an input data source defined by its source uuid4.

Note

You can remove references to individual sheets of a data source if you provide sheet_name. If not, the entire data source will be removed.

Parameters:
  • uid (UUID) – Unique uuid4 for an input data source. View all input data from method input_data.
  • sheet_name (str, default None) – If the data source has multiple sheets, provide the specific sheet to remove, or - by default - the entire data source will be removed.
update_data(source: whyqd.models.datasource_model.DataSourceModel, uid: uuid.UUID, sheet_name: Optional[str] = None) → None

Update an existing data source.

Can be used to modify which columns are to be preserved, or other specific changes.

Warning

You can only modify the following definitions: names, preserve, citation. Attempting to change any other definitions will raise an exception. Remove the source data instead.

Parameters:
  • source (DataSourceModel) – A dictionary conforming to the DataSourceModel. Each path can be to a filename, or a url.
  • uid (UUID) – Unique uuid4 for an input data source. View all input data from method input_data.
  • sheet_name (str, default None) – If the data source has multiple sheets, provide the specific sheet to update.
Raises:

ValueError if a sheet_name exists without a sheet_name being provided.

reorder_data(order: List[Union[uuid.UUID, Tuple[uuid.UUID, str]]]) → None

Reorder a list of source data prior to merging them.

Parameters:order (list of UUID or tuples of UUID, str) – Either a list of UUIDs, or tuples of hexed UUIDs and sheet_names, e.g. (‘uuid.hex’, ‘sheet_name’)
Raises:ValueError if the list of uuid4s doesn’t conform to that in the list of source data.
add_actions(actions: Union[str, List[str]], uid: uuid.UUID, sheet_name: Optional[str] = None) → None

Add an action script to a data source specified by its uid and optional sheet name.

Warning

Morph-type ACTIONS (such as ‘REBASE’, ‘PIVOT_LONGER’, and ‘PIVOT_WIDER’) change the header-row column names, and - with that - any of your subsequent referencing that relies on these names. It is best to run your morphs first, then your schema ACTIONS, that way you won’t get any weird referencing errors. If column errors do arise, check your ACTION ordering.

Parameters:
  • actions (str or list of str) – An action script.
  • uid (UUID) – Unique uuid4 for a either an input or interim data source.
  • sheet_name (str, default None) – If the data source has multiple sheets, provide the specific sheet to update.
remove_action(uid: uuid.UUID, action_uid: uuid.UUID, sheet_name: Optional[str] = None) → None

Remove an action from a data source defined by its source uuid4. Raises an exception of sheet_name applies to that data source.

Parameters:
  • uid (UUID) – Unique uuid4 for a either an input or interim data source.
  • action_uid (UUID) – Unique uuid4 for an action.
  • sheet_name (str, default None) – If the data source has multiple sheets, provide the specific sheet to update.
Raises:

ValueError if a sheet_name exists without a sheet_name being provided.

update_action(uid: uuid.UUID, action_uid: uuid.UUID, action: str, sheet_name: Optional[str] = None) → None

Update an action from a list of actions.

Parameters:
  • uid (UUID) – Unique uuid4 for a either an input or interim data source.
  • action_uid (UUID) – Unique uuid4 for an action.
  • action (str) – An updated action script.
  • sheet_name (str, default None) – If the data source has multiple sheets, provide the specific sheet to update.
reorder_actions(uid: uuid.UUID, order: List[uuid.UUID], sheet_name: Optional[str] = None) → None

Reorder a list of actions.

Parameters:
  • uid (UUID) – Unique uuid4 for a either an input or interim data source.
  • sheet_name (str, default None) – If the data source has multiple sheets, provide the specific sheet to update.
  • order (list of UUID) – List of uuid4 action strings.
Raises:

ValueError if the list of uuid4s doesn’t conform to that in the list of actions.

merge(script: str) → None

Merge input data to generate any required interim data. Will perform all actions on each interim data source.

Note

Merging, or an interim data source, are not required to produce a schema-defined destination data output.

Warning

There is only so much hand-holding possible: * If an interim data source already exists, and has existing actions, this function will reset the action list, placing this script first. * If further actions are added to input data, this function must be run again. * The first two points are, obviously, detrimental to each other. * And then there are ‘filters’ which are intrinsically destructive.

Merge script is of the form:

"MERGE < ['key_column'::'source_hex'::'sheet_name', etc.]"

Where the source terms are in order for merging.

Parameters:script (str) – Merge script, as defined.
transform(data: whyqd.models.datasource_model.DataSourceModel) → pandas.core.frame.DataFrame

Returns a transformed DataFrame after performing assigned action scripts, in order, to transform a data source.

Parameters:data (DataSourceModel) –
Returns:
Return type:Pandas DataFrame
build() → None

Merge input data to generate any required interim data. Will perform all actions on each interim data source.

Note

Merging, or an interim data source, are not required to produce a schema-defined destination data output.

Warning

There is only so much hand-holding possible: * If an interim data source already exists, and has existing actions, this function will reset the action list, placing this script first. * If further actions are added to input data, this function must be run again. * The first two points are, obviously, detrimental to each other. * And then there are ‘filters’ which are intrinsically destructive.

validate() → bool

Validate the build process and all data checksums. Will perform all actions on each interim data source.

Raises:ValueError if any steps fail to validate.
Returns:
Return type:bool
get_citation() → Dict[str, Union[str, Dict[str, str]]]

Get the citation as a dictionary.

Raises:ValueError if no citation has been declared or the build is incomplete.
Returns:
Return type:dict
set_citation(citation: whyqd.models.citation_model.CitationModel) → None

Update or create the citation.

Parameters:citation (CitationModel) – A dictionary conforming to the CitationModel.
get_json(hide_uuid: Optional[bool] = False) → Optional[pydantic.types.Json]

Get the json method model.

Parameters:hide_uuid (str, default False) – Hide all UUIDs in the nested JSON output.
Returns:
Return type:Json or None
save(directory: Optional[str] = None, filename: Optional[str] = None, created_by: Optional[str] = None, hide_uuid: Optional[bool] = False) → bool

Save schema as a json file.

Parameters:
  • directory (str) – Defaults to working directory
  • filename (str) – Defaults to schema name
  • created_by (string, default is None) – Declare the schema creator/updater
  • hide_uuid (str, default False) – Hide all UUIDs in the nested JSON output.
Returns:

Return type:

bool True if saved