Schema

Creating a schema is the first part of the wrangling process. Your schema defines the structural metadata target for your wrangling process. This is not the format your input data arrive in, but it is what you require it to look like when you’re done.

Your schema sets the requirements, constraints and sensible defaults available for creating each Method that will describe the process for wrangling input data into the fields defined by the schema. It can be reused to create multiple methods. Once complete, automated scripts can perform further cleaning and validation.

In simple terms, the columns in an input CSV or Excel-file will be restructured into new columns defined by the fields in your schema. These target fields are likely to be those in your database, or in your analytical software. Until your input data conform to this structure, you can’t do your real work.

Minimum valid requirements

A minimum valid schema requires a name to identify the schema, and a single, minimally-valid field containing a name and type:

{
    "name": "A simple name",
    "fields": [
    {
        "name": "Field name, e.g. 'column_name'",
        "type": "Valid data type, e.g. 'string', 'number'"
    }
    ]
}

Everything else is optional, unless specifically required by that field-type.

Schema descriptors

Schema terms include:

name

This is a required term. Spaces will be replaced with _ and the string will be lowercased.

title

A human-readable version of the schema name.

description

A complete description of the schema. Depending on how complex your work becomes, try and be as helpful as possible to ‘future-you’. You’ll thank yourself later.

Field descriptors

Fields, similarly, contain name, title and description, as well as type as compulsory. The available types are:

  • string: Any text-based string (this is the default),
  • number: Any number-based value, including integers and floats,
  • integer: Any integer-based value,
  • boolean: A boolean [true, false] value. Can set category constraints to fix term used,
  • object: Any valid JSON data,
  • array: Any valid array-based data,
  • date: Any date without a time. Must be in ISO8601 format, YYYY-MM-DD,
  • datetime: Any date with a time. Must be in ISO8601 format, with UTC time specified (optionally) as YYYY-MM-DD hh:mm:ss Zz,
  • year: Any year, formatted as YYYY.

To see all the parameter options for the SchemaModel:

>>> import whyqd
>>> whyqd.models.SchemaModel.schema()

   {'title': 'SchemaModel',
    'type': 'object',
    'properties': {'uuid': {'title': 'Uuid',
    'description': 'Automatically generated unique identity for the schema.',
    'type': 'string',
    'format': 'uuid'},
    'name': {'title': 'Name',
    'description': 'Machine-readable term to uniquely address this schema. Cannot have spaces. CamelCase or snake_case.',
    'type': 'string'},
    'title': {'title': 'Title',
    'description': 'A human-readable version of the schema name.',
    'type': 'string'},
    'description': {'title': 'Description',
    'description': "A complete description of the schema. Depending on how complex your work becomes, try and be as helpful as possible to 'future-you'. You'll thank yourself later.",
    'type': 'string'},
    'fields': {'title': 'Fields',
    'description': 'A list of fields which define the schema. Fields, similarly, contain `name`, `title` and `description`, as well as `type` as compulsory.',
    'default': [],
    'type': 'array',
    'items': {'$ref': '#/definitions/FieldModel'}},
    'version': {'title': 'Version',
    'description': 'Version and update history for the schema.',
    'default': [],
    'type': 'array',
    'items': {'$ref': '#/definitions/VersionModel'}}},
    'required': ['name'], ...

Allowing you to declare your initial schema:

{
    "name": "urban_population",
    "title": "Urban population",
    "description": "Urban population refers to people living in urban areas as defined by national statistical offices. It is calculated using World Bank population estimates and urban ratios from the United Nations World Urbanization Prospects. Aggregation of urban and rural population may not add up to total population because of different country coverages.",
}

name

This is a required term and is equivalent to a column header. Spaces will be replaced with _ and the string will be lowercased.

title

A human-readable version of the field name.

description

A complete description of the field. As for the schema, try and be as helpful as possible to future-you.

foreignKey

This is a boolean term, only required if you need this field to be treated as a foreign-key or identifier for your destination data:

"foreignKey": True

Data in this field will not be tested for uniqueness. Instead, these data will remain immutable, not being ‘forced’ into a date or number type to preserve whatever fruity formatting are described in your input data.

During the wrangling process, this field can be used for merging with other input data, ensuring consistency between sources.

type

type defines the data-type of the field. The core supported types:

  • string: any text-based string.
  • number: any number-based value, including integers and floats.
  • integer: any integer-based value.
  • boolean: a boolean [true, false] value. Can set category constraints to fix term used.
  • object: any valid JSON data.
  • array: any valid array-based data.
  • date: any date without a time. Must be in ISO8601 format, YYYY-MM-DD.
  • datetime: any date with a time. Must be in ISO8601 format, with UTC time specified (optionally) as YYYY-MM-DD hh:mm:ss Zz.
  • year: any year, formatted as YYYY.

missing

missing defines the value to be used for any blank values in a column. This is normally “” for text and np.nan for numbers or dates. You can, however, set your own defaults for each field.

Field constraints

Constraints are optional parameters that refine input data wrangling and act as a primary form of validation. Not all of these are available to every type, and default_field_settings(type) will list constraints available to a specific field type.

Define these as part of your schema definition for a specific field:

{
    "name": "Indicator Code",
    "title": "Indicator Code",
    "type": "string",
    "description": "World Bank code reference for Indicator Name.",
    "constraints": {"required": True, "unique": True},
}

All available constraints:

  • required: boolean, indicates whether this field is compulsory (but blank values in the input column are permitted and will be set to the missing default)
  • unique: boolean, if true then all values for that input column must be unique
  • minimum: integer / number, as appropriate defining min number of characters in a string, or the min values of numbers or integers
  • maximum: integer / number, as appropriate defining max number of characters in a string, or the max values of numbers or integers

There are two others, category and default.

Field constraints: category

Category data are the set of unique category terms permitted in this field. During wrangling you will be able to define values which should be assigned to each of these categories.

In JSON Schema, this is called enum, which isn’t particularly meaningful. In the whyqd API, you will refer to .category to reference the list of categories. However, in .json output files, these will be referenced as enum for compliance with the standard.

Define these as part of your schema definition for a specific field:

{
    "name": "test_field",
    "type": "string",
    "constraints": {
        "required": True,
        "category": [
            {"name": "dog", "description": "A type of mammal"},
            {"name": "cat", "description": "A different type of mammal"},
            {"name": "mouse", "description": "A small type of mammal"},
        ],
        "default": {"name": "dog", "description": "A type of mammal"},
    },
}

Each category can have a name, and a description, but the minimum is a name. You can also define a default to be used where source data are not defined.

Each field type will have its own category constraints. For example, boolean categories can use a different term than True / False defined by the category, but only permits two terms. Others have a minimum of one term in a category, but require the list member type to be string, number, etc. Ordinarily, category terms must be unique.