Skip to content

DataSourceDefinition

DataSourceDefinition

Bases: BaseDefinition

Create and manage a data source schema.

Strategy

Guidance on how to use this definition is in the strategies section on DataSource Strategies.

Info

whyqd supports any of the following file mime types:

  • CSV: "text/csv"
  • XLS: "application/vnd.ms-excel"
  • XLSX: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"
  • PARQUET (or PRQ): "application/vnd.apache.parquet"
  • FEATHER (or FTR): "application/vnd.apache.feather"

Declare it like so:

MIMETYPE = "xlsx" # upper- or lower-case is fine

Specify the mime type as a text string, uppper- or lower-case. Neither of Parquet or Feather yet have official mimetypes, so this is what we're using for now.

Parameters:

Name Type Description Default
source Path | str | DataSourceModel | None

A path to a json file containing a saved schema, or a dictionary conforming to the DataSourceModel.

None
Example

Create and validate a new DataSourceDefinition as follows:

import whyqd as qd

datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
datasource.save(directory=DIRECTORY)
datasource.validate()

get: DataSourceModel | list[DataSourceModel] | None property

Get the data source model.

Warning

If your source data are Excel, and that spreadsheet consists of multiple sheets, then whyqd will produce multiple data models which will be returned as a list. Each model will reflect the metadata for each sheet.

As always look at your data and test before implementing in code. You should see an additional sheet_name field.

Returns:

Type Description
DataSourceModel | list[DataSourceModel] | None

Pydantic DataSourceModel as a list, a single, or None

derive_model(*, source, mimetype, header=0, **attributes)

Derive a data model schema (or list) from a data source. All columns will be coerced to string type to preserve data quality even though this is far less efficient.

Parameters:

Name Type Description Default
source Path | str

Source filename.

required
mimetype str | MimeType

Pandas can read a diversity of mimetypes. whyqd supports xls, xlsx, csv, parquet and feather.

required
header int | list[int | None] | None

Row (0-indexed) to use for the column labels of the parsed DataFrame. If there are multiple sheets, then a list of integers should be provided. If header is None, row 0 will be treated as values and a set of field names will be generated indexed to the number of data columns.

0
attributes

dict of specific mimetype related Pandas attributes. Use sparingly.

{}

Returns:

Type Description
DataSourceModel | list[DataSourceModel]

List of DataSourceModel, or DataSourceModel

get_citation()

Get the citation as a dictionary.

Raises:

Type Description
ValueError

If no citation has been declared or the build is incomplete.

Returns:

Type Description
dict[str, str | dict[str, str]]

A dictionary conforming to the CitationModel.

get_data(*, refresh=False)

Get a Pandas (Modin) dataframe.

Parameters:

Name Type Description Default
refresh bool

Force an update of the dataframe if there have been attribute changes.

False

Returns:

Type Description
DataFrame | None

A dataframe, or none.

get_json(hide_uuid=False)

Get the json model.

Parameters:

Name Type Description Default
hide_uuid bool

Hide all UUIDs in the nested JSON output. Mostly useful for validation assertions where the only differences between sources are the UUIDs.

False

Returns:

Type Description
Json | None

Json-conforming output, or None.

save(directory=None, filename=None, created_by=None, hide_uuid=False)

Save model as a json file.

Parameters:

Name Type Description Default
directory str | None

Defaults to working directory

None
filename str | None

Defaults to model name

None
created_by str | None

Declare the model creator/updater

None
hide_uuid bool

Hide all UUIDs in the nested JSON output.

False

Returns:

Type Description
bool

Boolean True if saved.

set(*, schema=None)

Update or create the schema.

Parameters:

Name Type Description Default
schema Path | str | DataSourceModel | None

A dictionary conforming to the DataSourceModel.

None

set_citation(*, citation, index=None)

Update or create the citation.

Parameters:

Name Type Description Default
citation CitationModel

A dictionary conforming to the CitationModel.

required
index int

If there are multiple sources from the source data, provide the index (base 0) for the resource citation.

None

validate()

Validate that all required fields are returned from the crosswalk.