DataSourceDefinition

`DataSourceDefinition` ¶

Bases: BaseDefinition

Create and manage a data source schema.

Strategy

Guidance on how to use this definition is in the strategies section on DataSource Strategies.

Info

whyqd supports any of the following file mime types:

CSV: "text/csv"
XLS: "application/vnd.ms-excel"
XLSX: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
PARQUET (or PRQ): "application/vnd.apache.parquet"
FEATHER (or FTR): "application/vnd.apache.feather"

Declare it like so:

MIMETYPE = "xlsx" # upper- or lower-case is fine

Specify the mime type as a text string, uppper- or lower-case. Neither of Parquet or Feather yet have official mimetypes, so this is what we're using for now.

Parameters:

Name	Type	Description	Default
`source`	`Path \| str \| DataSourceModel \| None`	A path to a json file containing a saved schema, or a dictionary conforming to the DataSourceModel.	`None`

Example

Create and validate a new DataSourceDefinition as follows:

import whyqd as qd

datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
datasource.save(directory=DIRECTORY)
datasource.validate()

`get: DataSourceModel | list[DataSourceModel] | None` `property` ¶

Get the data source model.

Warning

If your source data are Excel, and that spreadsheet consists of multiple sheets, then whyqd will produce multiple data models which will be returned as a list. Each model will reflect the metadata for each sheet.

As always look at your data and test before implementing in code. You should see an additional sheet_name field.

Returns:

Type	Description
`DataSourceModel \| list[DataSourceModel] \| None`	Pydantic DataSourceModel as a list, a single, or None

`derive_model(*, source, mimetype, header=0, **attributes)` ¶

Derive a data model schema (or list) from a data source. All columns will be coerced to string type to preserve data quality even though this is far less efficient.

Parameters:

Name	Type	Description	Default
`source`	`Path \| str`	Source filename.	required
`mimetype`	`str \| MimeType`	Pandas can read a diversity of mimetypes. whyqd supports `xls`, `xlsx`, `csv`, `parquet` and `feather`.	required
`header`	`int \| list[int \| None] \| None`	Row (0-indexed) to use for the column labels of the parsed DataFrame. If there are multiple sheets, then a list of integers should be provided. If `header` is `None`, row 0 will be treated as values and a set of field names will be generated indexed to the number of data columns.	`0`
`attributes`		dict of specific `mimetype` related Pandas attributes. Use sparingly.	`{}`

Returns:

Type	Description
`DataSourceModel \| list[DataSourceModel]`	List of DataSourceModel, or DataSourceModel

`get_citation()` ¶

Get the citation as a dictionary.

Raises:

Type	Description
`ValueError`	If no citation has been declared or the build is incomplete.

Returns:

Type	Description
`dict[str, str \| dict[str, str]]`	A dictionary conforming to the CitationModel.

`get_data(*, refresh=False)` ¶

Get a Pandas (Modin) dataframe.

Parameters:

Name	Type	Description	Default
`refresh`	`bool`	Force an update of the dataframe if there have been attribute changes.	`False`

Returns:

Type	Description
`DataFrame \| None`	A dataframe, or none.

`get_json(hide_uuid=False)` ¶

Get the json model.

Parameters:

Name	Type	Description	Default
`hide_uuid`	`bool`	Hide all UUIDs in the nested JSON output. Mostly useful for validation assertions where the only differences between sources are the UUIDs.	`False`

Returns:

Type	Description
`Json \| None`	Json-conforming output, or None.

`save(directory=None, filename=None, created_by=None, hide_uuid=False)` ¶

Save model as a json file.

Parameters:

Name	Type	Description	Default
`directory`	`str \| None`	Defaults to working directory	`None`
`filename`	`str \| None`	Defaults to model name	`None`
`created_by`	`str \| None`	Declare the model creator/updater	`None`
`hide_uuid`	`bool`	Hide all UUIDs in the nested JSON output.	`False`

Returns:

Type	Description
`bool`	Boolean True if saved.

`set(*, schema=None)` ¶

Update or create the schema.

Parameters:

Name	Type	Description	Default
`schema`	`Path \| str \| DataSourceModel \| None`	A dictionary conforming to the DataSourceModel.	`None`

`set_citation(*, citation, index=None)` ¶

Update or create the citation.

Parameters:

Name	Type	Description	Default
`citation`	`CitationModel`	A dictionary conforming to the CitationModel.	required
`index`	`int`	If there are multiple sources from the source data, provide the index (base 0) for the resource citation.	`None`

`validate()` ¶

Validate that all required fields are returned from the crosswalk.

DataSourceDefinition