DataSourceDefinition
DataSourceDefinition
¶
Bases: BaseDefinition
Create and manage a data source schema.
Strategy
Guidance on how to use this definition is in the strategies section on DataSource Strategies.
Info
whyqd supports any of the following file mime types:
CSV: "text/csv"XLS: "application/vnd.ms-excel"XLSX: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"PARQUET(orPRQ): "application/vnd.apache.parquet"FEATHER(orFTR): "application/vnd.apache.feather"
Declare it like so:
MIMETYPE = "xlsx" # upper- or lower-case is fine
Specify the mime type as a text string, uppper- or lower-case. Neither of Parquet or Feather yet have official mimetypes, so this is what we're using for now.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
Path | str | DataSourceModel | None
|
A path to a json file containing a saved schema, or a dictionary conforming to the DataSourceModel. |
None
|
Example
Create and validate a new DataSourceDefinition as follows:
import whyqd as qd
datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
datasource.save(directory=DIRECTORY)
datasource.validate()
get: DataSourceModel | list[DataSourceModel] | None
property
¶
Get the data source model.
Warning
If your source data are Excel, and that spreadsheet consists of multiple sheets, then whyqd will
produce multiple data models which will be returned as a list. Each model will reflect the metadata for
each sheet.
As always look at your data and test before implementing in code. You should see an additional sheet_name
field.
Returns:
| Type | Description |
|---|---|
DataSourceModel | list[DataSourceModel] | None
|
Pydantic DataSourceModel as a list, a single, or None |
derive_model(*, source, mimetype, header=0, **attributes)
¶
Derive a data model schema (or list) from a data source. All columns will be coerced to string type to
preserve data quality even though this is far less efficient.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
source |
Path | str
|
Source filename. |
required |
mimetype |
str | MimeType
|
Pandas can read a diversity of mimetypes. whyqd supports |
required |
header |
int | list[int | None] | None
|
Row (0-indexed) to use for the column labels of the parsed DataFrame. If there are multiple sheets, then
a list of integers should be provided. If |
0
|
attributes |
dict of specific |
{}
|
Returns:
| Type | Description |
|---|---|
DataSourceModel | list[DataSourceModel]
|
List of DataSourceModel, or DataSourceModel |
get_citation()
¶
Get the citation as a dictionary.
Raises:
| Type | Description |
|---|---|
ValueError
|
If no citation has been declared or the build is incomplete. |
Returns:
| Type | Description |
|---|---|
dict[str, str | dict[str, str]]
|
A dictionary conforming to the CitationModel. |
get_data(*, refresh=False)
¶
Get a Pandas (Modin) dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
refresh |
bool
|
Force an update of the dataframe if there have been attribute changes. |
False
|
Returns:
| Type | Description |
|---|---|
DataFrame | None
|
A dataframe, or none. |
get_json(hide_uuid=False)
¶
Get the json model.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hide_uuid |
bool
|
Hide all UUIDs in the nested JSON output. Mostly useful for validation assertions where the only differences between sources are the UUIDs. |
False
|
Returns:
| Type | Description |
|---|---|
Json | None
|
Json-conforming output, or None. |
save(directory=None, filename=None, created_by=None, hide_uuid=False)
¶
Save model as a json file.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory |
str | None
|
Defaults to working directory |
None
|
filename |
str | None
|
Defaults to model name |
None
|
created_by |
str | None
|
Declare the model creator/updater |
None
|
hide_uuid |
bool
|
Hide all UUIDs in the nested JSON output. |
False
|
Returns:
| Type | Description |
|---|---|
bool
|
Boolean True if saved. |
set(*, schema=None)
¶
Update or create the schema.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
schema |
Path | str | DataSourceModel | None
|
A dictionary conforming to the DataSourceModel. |
None
|
set_citation(*, citation, index=None)
¶
Update or create the citation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
citation |
CitationModel
|
A dictionary conforming to the CitationModel. |
required |
index |
int
|
If there are multiple sources from the source data, provide the index (base 0) for the resource citation. |
None
|
validate()
¶
Validate that all required fields are returned from the crosswalk.