DataSourceDefinition
DataSourceDefinition
¶
Bases: BaseDefinition
Create and manage a data source schema.
Strategy
Guidance on how to use this definition is in the strategies section on DataSource Strategies.
Info
whyqd supports any of the following file mime types:
CSV
: "text/csv"XLS
: "application/vnd.ms-excel"XLSX
: "application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"PARQUET
(orPRQ
): "application/vnd.apache.parquet"FEATHER
(orFTR
): "application/vnd.apache.feather"
Declare it like so:
MIMETYPE = "xlsx" # upper- or lower-case is fine
Specify the mime type as a text string, uppper- or lower-case. Neither of Parquet or Feather yet have official mimetypes, so this is what we're using for now.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source |
Path | str | DataSourceModel | None
|
A path to a json file containing a saved schema, or a dictionary conforming to the DataSourceModel. |
None
|
Example
Create and validate a new DataSourceDefinition
as follows:
import whyqd as qd
datasource = qd.DataSourceDefinition()
datasource.derive_model(source=DATASOURCE_PATH, mimetype=MIMETYPE)
datasource.save(directory=DIRECTORY)
datasource.validate()
get: DataSourceModel | list[DataSourceModel] | None
property
¶
Get the data source model.
Warning
If your source data are Excel
, and that spreadsheet consists of multiple sheets
, then whyqd will
produce multiple data models which will be returned as a list. Each model will reflect the metadata for
each sheet.
As always look at your data and test before implementing in code. You should see an additional sheet_name
field.
Returns:
Type | Description |
---|---|
DataSourceModel | list[DataSourceModel] | None
|
Pydantic DataSourceModel as a list, a single, or None |
derive_model(*, source, mimetype, header=0, **attributes)
¶
Derive a data model schema (or list) from a data source. All columns will be coerced to string
type to
preserve data quality even though this is far less efficient.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
source |
Path | str
|
Source filename. |
required |
mimetype |
str | MimeType
|
Pandas can read a diversity of mimetypes. whyqd supports |
required |
header |
int | list[int | None] | None
|
Row (0-indexed) to use for the column labels of the parsed DataFrame. If there are multiple sheets, then
a list of integers should be provided. If |
0
|
attributes |
dict of specific |
{}
|
Returns:
Type | Description |
---|---|
DataSourceModel | list[DataSourceModel]
|
List of DataSourceModel, or DataSourceModel |
get_citation()
¶
Get the citation as a dictionary.
Raises:
Type | Description |
---|---|
ValueError
|
If no citation has been declared or the build is incomplete. |
Returns:
Type | Description |
---|---|
dict[str, str | dict[str, str]]
|
A dictionary conforming to the CitationModel. |
get_data(*, refresh=False)
¶
Get a Pandas (Modin) dataframe.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
refresh |
bool
|
Force an update of the dataframe if there have been attribute changes. |
False
|
Returns:
Type | Description |
---|---|
pd.DataFrame | None
|
A dataframe, or none. |
get_json(hide_uuid=False)
¶
Get the json model.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
hide_uuid |
bool
|
Hide all UUIDs in the nested JSON output. Mostly useful for validation assertions where the only differences between sources are the UUIDs. |
False
|
Returns:
Type | Description |
---|---|
Json | None
|
Json-conforming output, or None. |
save(directory=None, filename=None, created_by=None, hide_uuid=False)
¶
Save model as a json file.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory |
str | None
|
Defaults to working directory |
None
|
filename |
str | None
|
Defaults to model name |
None
|
created_by |
str | None
|
Declare the model creator/updater |
None
|
hide_uuid |
bool
|
Hide all UUIDs in the nested JSON output. |
False
|
Returns:
Type | Description |
---|---|
bool
|
Boolean True if saved. |
set(*, schema=None)
¶
Update or create the schema.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
schema |
Path | str | DataSourceModel | None
|
A dictionary conforming to the DataSourceModel. |
None
|
set_citation(*, citation, index=None)
¶
Update or create the citation.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
citation |
CitationModel
|
A dictionary conforming to the CitationModel. |
required |
index |
int
|
If there are multiple sources from the source data, provide the index (base 0) for the resource citation. |
None
|
validate()
¶
Validate that all required fields are returned from the crosswalk.