Create and manage a Schema definition¶
A whyqd (/wɪkɪd/) schema
definition describes the structural organisation of tabular data. Each column is
identified by a field name and defined by conformance to technical specifications. These, along with field constraints
and sensible defaults, ensure interoperability.
In simple terms, the columns in source input data must correspond to the fields defined in your schema. These fields may also be reflected as columns in your database, or analytical software.
If your source data conform to your schema definition, you can develop your software and analysis without any data being present. Data validation with your schema are a separate process and permit you to encapsulate each component of your project and data management.
API
Review the class
API definitions: SchemaDefinition and CRUDFields.
Minimum valid requirements¶
A minimum valid schema requires a name
to identify the schema, and a single, minimally-valid field
containing a
name
and type
:
{
"name": "A simple name",
"fields": [
{
"name": "Field name, e.g. 'column_name'",
"type": "Valid data type, e.g. 'string', 'number'"
}
]
}
Everything else is optional, unless specifically required by that field-type.
Schema descriptors¶
Schema terms include:
name
¶
This is a required term. Spaces will be replaced with _
and the string will be lowercased.
title
¶
A human-readable version of the schema name.
description
¶
A complete description of the schema. Depending on how complex your work becomes, try and be as helpful as possible to 'future-you'. You'll thank yourself later.
missingValues
¶
missingValues
indicates which string values should be treated as null values. There could be a variety of these, such
as '..', or '-'.
primaryKey
¶
A field or set of fields which uniquely identifies each row in the table. Specify using the name
of relevant fields.
Data in this field will not be tested for uniqueness. Instead, these data will remain immutable, not being 'forced' into a date or number type to preserve whatever fruity formatting are described in your input data.
index
¶
Maximum value of a zero-base index for tabular data defined by this schema. Necessary where actions
apply row-level transforms.
citation
¶
Full citation for definition. More information can be found in the data source section.
version
¶
Version and update history for the schema. This is automatically generated when you save the definition. It includes a
minimum of updated
with the date. Can also include a name
for the person producing the version, and a description
of the changes or updates made.
Field descriptors¶
Fields, similarly, contain name
, title
and description
, as well as type
as compulsory. The available types are:
string
: Any text-based string (this is the default),number
: Any number-based value, including integers and floats,integer
: Any integer-based value,boolean
: A boolean [true, false] value. Can set category constraints to fix term used,object
: Any valid JSON data,array
: Any valid array-based data,date
: Any date without a time. Must be in ISO8601 format,YYYY-MM-DD
,datetime
: Any date with a time. Must be in ISO8601 format, with UTC time specified (optionally) asYYYY-MM-DD hh:mm:ss Zz
,year
: Any year, formatted asYYYY
.
To see all the parameter options for the SchemaModel
:
import whyqd as qd
qd.models.SchemaModel.schema()
{'title': 'SchemaModel',
'type': 'object',
'properties': {'uuid': {'title': 'Uuid',
'description': 'Automatically generated unique identity for the schema.',
'type': 'string',
'format': 'uuid'},
'name': {'title': 'Name',
'description': 'Machine-readable term to uniquely address this schema. Cannot have spaces. CamelCase or snake_case.',
'type': 'string'},
'title': {'title': 'Title',
'description': 'A human-readable version of the schema name.',
'type': 'string'},
'description': {'title': 'Description',
'description': "A complete description of the schema. Depending on how complex your work becomes, try and be as helpful as possible to 'future-you'. You'll thank yourself later.",
'type': 'string'},
'fields': {'title': 'Fields',
'description': 'A list of fields which define the schema. Fields, similarly, contain `name`, `title` and `description`, as well as `type` as compulsory.',
'default': [],
'type': 'array',
'items': {'$ref': '#/definitions/FieldModel'}},
'version': {'title': 'Version',
'description': 'Version and update history for the schema.',
'default': [],
'type': 'array',
'items': {'$ref': '#/definitions/VersionModel'}}},
'required': ['name'], ...
Allowing you to define your initial schema
:
{
"name": "urban_population",
"title": "Urban population",
"description": "Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects. Aggregation of urban and rural population may not add up to total population because of different country coverages.",
}
name
¶
This is a required term and is equivalent to a column header. It must be defined exactly as it appears in the tabular source data.
By convention, this should be snake_case with spaces replaced with underscore (e.g. field_1
), or as camelCase with
linked words capitalised (e.g. fieldOne
). However, given the range of naming conventions, this can only be a
recommendation.
title
¶
A human-readable version of the field name.
description
¶
A complete description of the field. As for the schema, try and be as helpful as possible to future-you.
dtype
or type
¶
dtype
or type
defines the data-type of the field. The core supported types:
string
: any text-based string.number
: any number-based value, including integers and floats.integer
: any integer-based value.boolean
: a boolean [true
,false
] value. Can set category constraints to fix term used.object
: any valid JSON data.array
: any valid array-based data.date
: any date without a time. Must be in ISO8601 format,YYYY-MM-DD
.datetime
: any date with a time. Must be in ISO8601 format, with UTC time specified (optionally) asYYYY-MM-DD hh:mm:ss Zz
.year
: any year, formatted asYYYY
.
Since the type
variable is protected in Python, you'll see it used interchangeably as type
or dtype
depending on
the context. To comply with the JSON Schema definitions, JSON outputs will convert the field name to type
.
example
¶
An example value, as a string, for the field.
Field constraints¶
Constraints
are optional parameters that act as a primary form of validation. Not all of these are available to every
type
, and default_field_settings(type)
will list constraints available to a specific field type.
Define these as part of your schema definition for a specific field:
{
"name": "indicator_code",
"title": "Indicator Code",
"type": "string",
"description": "World Bank code reference for Indicator Name.",
"constraints": {"required": True, "unique": True},
}
All available constraints:
required
:boolean
, indicates whether this field is compulsory (but blank values in the input column are permitted and will be set to the missing default)unique
:boolean
, ifTrue
then all values for that input column must be uniquedefault
: Default category (or string) term used when source values are ambiguous, or unstated.category
: The set of unique category terms permitted in this field, withname
& (optional)description
.minimum
:integer
/number
, as appropriate defining min number of characters in a string, or the min values of numbers or integersmaximum
:integer
/number
, as appropriate defining max number of characters in a string, or the max values of numbers or integers
category
¶
Category
data are the set of unique category terms permitted in this field. When you define your crosswalk you can
define values which should be assigned to each of these categories.
In JSON Schema, this is called enum
. In the whyqd
API, you will refer to .category
to
reference the list of categories. However, in the json output files, these will be referenced as enum
for compliance
with the standard.
Define these as part of your schema definition for a specific field:
{
"name": "test_field",
"type": "string",
"constraints": {
"required": True,
"category": [
{"name": "dog", "description": "A type of mammal"},
{"name": "cat", "description": "A different type of mammal"},
{"name": "mouse", "description": "A small type of mammal"},
],
"default": {"name": "dog", "description": "A type of mammal"},
},
}
Each category
can have a name
, and a description
, but the minimum is a name
.
Each field type
will have its own category constraints. For example, boolean categories can use a different term than
True / False defined by the category, but only permits two terms. Others have a minimum of one term in a category, but
require the list member type to be string
, number
, etc. Ordinarily, category
terms must be unique.
Creating a Schema¶
The objective of your schema is not only to define a structure for your data, but also provide reference and contextual information for anyone using it. In a research context, definitions are critical to avoid ambiguity, ensure replication, and build trust.
You can import a schema definition from a file, or you can build it interactively, as shown here. We'll start by importing whyqd and defining a new schema.
The minimum requirement for a schema is that it have a name
, but we're going to give it a title
and description
as well, because more information is better. We're not barbarians:
import whyqd as qd
schema: qd.models.SchemaModel = {
"name": "urban_population",
"title": "Urban population",
"description": "Urban population refers to people living in urban areas as defined by national statistical offices.",
}
schema_destination = qd.SchemaDefinition()
schema_destination.set(schema=schema)
directory
:
directory = "/path/to/directory"
filename = "urban_population_2020"
schema.save(directory=directory, filename=filename, created_by="Gavin Chait")
If the optional filename
is not provided, the name you specified in the schema
dictionary will be used. The file is
a JSON format text file, but will have the extension .schema
. A version history will be automatically created, and you
can add your name as created_by
.
We'll now start to create each of our schema fields
.
Info
You can think of a schema field
as a column
in a table, or a field
in a database. Each field, unsurprisingly,
has a name
, title
and description
, of which only the name
is required.
Fields have a type
, such as number
or string
. This describes the data expected and limits the actions which
can be performed during the wrangling process
We want our destination data to conform to the following structure:
la_code | ba_ref | occupant_name | postcode | occupation_state | occupation_state_date | prop_ba_rates | occupation_state_reliefs |
---|---|---|---|---|---|---|---|
E06000044 | 177500080710 | A company | PO5 2SE | True | 2019-04-01 | 98530 | [small_business, retail] |
We'll build a single dictionary and then iterate over the list to add each field:
import whyqd as qd
schema: qd.models.SchemaModel = {
"name": "rates_data",
"title": "Commercial rates data",
"description": "Standardised schema for archival and analysis of commercial / non-domestic rates data.",
}
fields: list[qd.models.FieldModel] = [
{
"name": "la_code",
"title": "Local authority code",
"type": "string",
"description": "Standard code for local authority."
},
{
"name": "ba_ref",
"title": "Billing reference",
"type": "string",
"description": "Unique code for a specific hereditament. May be multiple rows for history."
},
{
"name": "prop_ba_rates",
"title": "Property billing rates",
"type": "number",
"description": "Actual rates paid by a specific ratepayer."
},
{
"name": "occupant_name",
"title": "Occupier name",
"type": "string",
"description": "Name of the ratepayer."
},
{
"name": "postcode",
"title": "Postcode",
"type": "string",
"description": "Full address or postcode of ratepayer."
},
{
"name": "occupation_state",
"title": "Occupation state",
"type": "boolean",
"description": "Occupation status, void or occupied."
},
{
"name": "occupation_state_date",
"title": "Date of occupation state",
"type": "date",
"description": "Date of the start of status in occupation_state."
},
{
"name": "occupation_state_reliefs",
"title": "Occupation state reliefs",
"type": "array",
"description": "Array of the categories of reliefs / exemptions applied."
}
]
schema_destination = qd.SchemaDefinition()
schema_destination.set(schema=schema)
schema_destination.fields.add_multi(terms=fields)
You could also add
:
for field in fields:
schema.fields.add(terms=field)
From here on we can access any field
by calling it by name
and then updating it as required:
schema.fields.get(name="occupation_state_reliefs")
{'name': 'occupation_state_reliefs',
'type': 'array',
'title': 'Occupation state reliefs',
'description': 'Array of the categories of reliefs / exemptions applied.'}
Let's add a list of category
terms as a constraint for occupation_state_reliefs
:
categories = ["small_business", "rural", "charity", "enterprise_zone", "vacancy", "hardship", "retail", "discretionary", "exempt", "transitional", "other"]
constraints = {
"categories": [{
"name": category for category in categories
}]
}
schema.fields.set_constraints(name="occupation_state_reliefs", constraints=constraints)
schema.fields.get(name="occupation_state_reliefs").dict(by_alias=True, exclude_defaults=True, exclude_none=True)
{'uuid': UUID('cf4d066e-22a8-4b76-8956-f6120eec4c52'),
'name': 'occupation_state_reliefs',
'title': 'Occupation state reliefs',
'description': 'Array of the categories of reliefs / exemptions applied.',
'type': 'array',
'constraints': {'enum': [{'uuid': UUID('daa206a9-ac8c-41a9-a504-06410780ee50'),
'name': 'small_business'},
{'uuid': UUID('5964e9fc-dd50-4856-acdc-2326ea48ef1d'), 'name': 'rural'},
{'uuid': UUID('498654f9-8825-4f3d-a573-0c110726fba4'), 'name': 'charity'},
{'uuid': UUID('f94353ce-a489-4fb1-ad78-5435b3dd54a4'),
'name': 'enterprise_zone'},
{'uuid': UUID('41285fc0-2321-4542-b7f1-e8e535588559'), 'name': 'vacancy'},
{'uuid': UUID('28068ff2-15ff-409a-9a8f-f97c39407812'), 'name': 'hardship'},
{'uuid': UUID('b8041d21-f8ca-47b9-b3fe-7b9077388459'), 'name': 'retail'},
{'uuid': UUID('83bda0d4-3d94-4738-a580-cfe0881c8e4d'),
'name': 'discretionary'},
{'uuid': UUID('ff2cbc0c-839b-430c-bdca-ac4238634f05'), 'name': 'exempt'},
{'uuid': UUID('f4300571-c04b-4cbf-b835-16c5ae3343b0'),
'name': 'transitional'},
{'uuid': UUID('8a3af6f4-f48c-4614-83f2-ba472b2129e9'), 'name': 'other'}]}}
The term .dict(by_alias=True, exclude_defaults=True, exclude_none=True)
is used to extract a dictionary format from
the underlying Pydantic model used by whyqd
.
Info
These are the official business rates reliefs permitted by the UK government. Unsurprisingly, only by accident do any local authorities actually use these terms when awarding a relief.
Review your schema, then save
and we're ready to begin schema-to-schema conversion:
schema.get.dict(by_alias=True, exclude_defaults=True, exclude_none=True)
{'uuid': UUID('19692345-2caf-46b1-9a8f-276491520c6b'),
'name': 'test_schema',
'title': 'Test Schema',
'description': 'A test Schema',
'fields': [{'uuid': UUID('615d2cd0-f8b6-4449-b3d2-642fa4836888'),
'name': 'la_code',
'title': 'Local authority code',
'description': 'Standard code for local authority.',
'type': 'string',
'constraints': {'default': {'uuid': UUID('579342cd-bba8-41cd-bf45-3c517b8cd75e'),
'name': 'E06000044'}}},
{'uuid': UUID('95f5c53c-59e1-4bb7-917d-7177b01d2d3c'),
'name': 'ba_ref',
'title': 'Billing reference',
'description': 'Unique code for a specific hereditament. May be multiple rows for history.',
'type': 'string'},
{'uuid': UUID('7572ae3e-d725-4897-84fb-5c5b45bd4edb'),
'name': 'prop_ba_rates',
'title': 'Property billing rates',
'description': 'Actual rates paid by a specific ratepayer.',
'type': 'number'},
{'uuid': UUID('ac76c3ab-5ef8-4641-99ec-aab2c5b7414c'),
'name': 'occupant_name',
'title': 'Occupier name',
'description': 'Name of the ratepayer.',
'type': 'string'},
{'uuid': UUID('26440eba-fd1d-40af-a52c-a9351fad2fd9'),
'name': 'postcode',
'title': 'Postcode',
'description': 'Full address or postcode of ratepayer.',
'type': 'string'},
{'uuid': UUID('28d7863b-22fa-4bd5-a221-0607643f0111'),
'name': 'occupation_state',
'title': 'Occupation state',
'description': 'Occupation status, void or occupied.',
'type': 'boolean',
'constraints': {'enum': [{'uuid': UUID('353bd4ac-d677-47c4-af40-6f651af2cc5e'),
'name': True},
{'uuid': UUID('33f8b2f8-9ac5-412a-9507-879bb7f845ce'), 'name': False}],
'default': {'uuid': UUID('353bd4ac-d677-47c4-af40-6f651af2cc5e'),
'name': True}}},
{'uuid': UUID('79a70822-4e24-4a68-9036-992def200cd6'),
'name': 'occupation_state_date',
'title': 'Date of occupation state',
'description': 'Date of the start of status in occupation_state.',
'type': 'date'},
{'uuid': UUID('cf4d066e-22a8-4b76-8956-f6120eec4c52'),
'name': 'occupation_state_reliefs',
'title': 'Occupation state reliefs',
'description': 'Array of the categories of reliefs / exemptions applied.',
'type': 'array',
'constraints': {'enum': [{'uuid': UUID('daa206a9-ac8c-41a9-a504-06410780ee50'),
'name': 'small_business'},
{'uuid': UUID('5964e9fc-dd50-4856-acdc-2326ea48ef1d'), 'name': 'rural'},
{'uuid': UUID('498654f9-8825-4f3d-a573-0c110726fba4'), 'name': 'charity'},
{'uuid': UUID('f94353ce-a489-4fb1-ad78-5435b3dd54a4'),
'name': 'enterprise_zone'},
{'uuid': UUID('41285fc0-2321-4542-b7f1-e8e535588559'), 'name': 'vacancy'},
{'uuid': UUID('28068ff2-15ff-409a-9a8f-f97c39407812'),
'name': 'hardship'},
{'uuid': UUID('b8041d21-f8ca-47b9-b3fe-7b9077388459'), 'name': 'retail'},
{'uuid': UUID('83bda0d4-3d94-4738-a580-cfe0881c8e4d'),
'name': 'discretionary'},
{'uuid': UUID('ff2cbc0c-839b-430c-bdca-ac4238634f05'), 'name': 'exempt'},
{'uuid': UUID('f4300571-c04b-4cbf-b835-16c5ae3343b0'),
'name': 'transitional'},
{'uuid': UUID('8a3af6f4-f48c-4614-83f2-ba472b2129e9'),
'name': 'other'}]}}]}
schema.save(directory=directory, filename=filename, created_by="Gavin Chait")
Whyqd's data source strategies show you how to derive a schema to reflect source data.