Schema

Creating a schema is the first part of the wrangling process. Your schema defines the structural metadata target for your wrangling process. This is not the format your input data arrive in, but it is what you require it to look like when you’re done.

Your schema sets the requirements, constraints and sensible defaults available for creating each method that will describe the process for wrangling input data into the fields defined by the schema. It can be reused to create multiple methods. Once complete, automated scripts can perform further cleaning and validation.

In simple terms, the columns in an input CSV or Excel-file will be restructured into new columns defined by the fields in your schema. These target fields are likely to be those in your database, or in your analytical software. Until your input data conform to this structure, you can’t do your real work.

Minimum valid requirements

A minimum valid schema requires a name to identify the schema, and a single, minimally-valid field containing a name and type:

{
  "name": "A simple name",
  "fields": [
        {
                "name": "Field name, e.g. 'column_name'",
                "type": "Valid data type, e.g. 'string', 'number'"
        }
  ]
}

Everything else is optional, unless specifically required by that field-type.

Schema descriptors

Schema terms include:

name

This is a required term. Spaces will be replaced with _ and the string will be lowercased.

title

A human-readable version of the schema name.

description

A complete description of the schema. Depending on how complex your work becomes, try and be as helpful as possible to ‘future-you’. You’ll thank yourself later.

Field descriptors

Fields, similarly, contain name, title and description, as well as type as compulsory. To see the available field types:

>>> sc = _d.Schema()
>>> sc.default_field_types
['string',
'number',
'integer',
'boolean',
'object',
'array',
'date',
'datetime',
'year']

To see further parameter options for each default type:

>>> sc.default_field_settings('string')
{
        'required': ['name', 'type'],
        'name': 'field_name',
        'title': 'A human-readable version of the field name',
        'description': 'Any text-based string.',
        'type': 'string',
        'format': {
                'description': 'The format keyword options for `string` are `default`, `email`, `uri`, `binary`, and `uuid`.',
                'category': ['default', 'email', 'uri', 'binary', 'uuid'],
                'default': 'default'
                },
        'foreignKey': {
                'type': 'boolean',
                'description': 'Set `foreignKey` `true` if the field is to be treated as an immutable value.'
                },
        'constraints': {
                'description': 'The following constraints are supported.',
                'properties': {
                        'required': {
                                'type': 'boolean',
                                'description': 'Indicates whether a property must have a value for each instance.'
                        },
                        'unique': {
                                'type': 'boolean',
                                'description': 'When `true`, each value for the property `MUST` be unique.'
                        },
                'category': {
                        'type': 'array',
                        'minItems': 1,
                        'uniqueItems': True,
                        'terms': {
                                'type': 'string'
                                }
                        },
                'minimum': {
                        'type': 'integer',
                        'description': 'An integer that specifies the minimum length of a value.'
                        },
                'maximum': {
                        'type': 'integer',
                        'description': 'An integer that specifies the maximum length of a value.'
                        }
                }
        },
'missing': {
        'description': 'Default to be used for missing values.',
        'default': ''
        }
}

name

This is a required term and is equivalent to a column header. Spaces will be replaced with _ and the string will be lowercased.

title

A human-readable version of the field name.

description

A complete description of the field. As for the schema, try and be as helpful as possible to future-you.

foreignKey

This is a boolean term, only required if you need this field to be treated as a foreign-key or identifier for your destination data:

"foreignKey": True

Data in this field will not be tested for uniqueness. Instead, these data will remain immutable, not being ‘forced’ into a date or number type to preserve whatever fruity formatting are described in your input data.

During the wrangling process, this field can be used for merging with other input data, ensuring consistency between sources.

type and format

type defines the data-type of the field, while format - which is currently unsupported in wrangling - further refines the specific type properties. The core supported types, with indents for formats:

  • string: any text-based string.
    • default: any string
    • email: an email address
    • uri: any web address / URI
  • number: any number-based value, including integers and floats.
  • integer: any integer-based value.
  • boolean: a boolean [true, false] value. Can set category constraints to fix term used.
  • object: any valid JSON data.
  • array: any valid array-based data.
  • date: any date without a time. Must be in ISO8601 format, YYYY-MM-DD.
  • datetime: any date with a time. Must be in ISO8601 format, with UTC time specified (optionally) as YYYY-MM-DD hh:mm:ss Zz.
  • year: any year, formatted as YYYY.

missing

missing defines the value to be used for any blank values in a column. This is normally “” for text and np.nan for numbers or dates. You can, however, set your own defaults for each field.

Field constraints

Constraints are optional parameters that refine input data wrangling and act as a primary form of validation. Not all of these are available to every type, and default_field_settings(type) will list constraints available to a specific field type.

Define these as part of your schema definition for a specific field:

{
  "name": "A simple name",
  "fields": [
        {
                "name": "Field name, e.g. 'column_name'",
                "type": "Valid data type, e.g. 'string', 'number'",
                "constraints": {
                        "required": True,
                        "unique": True
                }
        }
  ]
}

All available constraints:

  • required: boolean, indicates whether this field is compulsory (but blank values in the input column are permitted and will be set to the missing default)
  • unique: boolean, if true then all values for that input column must be unique
  • minimum: integer / number, as appropriate defining min number of characters in a string, or the min values of numbers or integers
  • maximum: integer / number, as appropriate defining max number of characters in a string, or the max values of numbers or integers

Field constraints: category

In FrictionlessData.io, this is called enum, which isn’t particularly meaningful. Category data are the set of unique category terms permitted in this field. During wrangling you will be able to define values which should be assigned to each of these categories.

Define these as part of your schema definition for a specific field:

{
  "name": "A simple name",
  "fields": [
        {
                "name": "Field name, e.g. 'column_name'",
                "type": "Valid data type, e.g. 'string', 'number'",
                "constraints": {
                        "category": ["cheddar", "gouda", "other"]
                }
        }
  ]
}

Each field type will have its own category constraints. For example, boolean categories can use a different term than True / False defined by the category, but only permits two terms. Others have a minimum of one term in a category, but require the list member type to be string, number, etc. Ordinarily, category terms must be unique.

Review the default_field_settings(type) for that field’s specific category constraints.

Field constraints: filter

Filters are a constraint that filter a named field, or the foreignKey, by date-limited data.

Define these as part of your schema definition for a valid field:

{
  "name": "A simple name",
  "fields": [
        {
                "name": "Field name, e.g. 'column_name'",
                "type": "Valid data type, e.g. 'date', 'datetime'",
                "filter": {
                        "field": "foreignKey",
                        "modifiers": ["LATEST", "AFTER"]
                }
        }
  ]
}

There are two compulsory parameters defining a filter:

  • field: another field which is the subject of this filter, or by default the ‘foreignKey’.
  • modifiers: an array of permitted filter terms, including any of [“LATEST”, “AFTER”, “BEFORE”, “ALL”].

call default_filter_names() to get a list, and default_filter_settings(filter_name) to get a definition

For example, to filter all foreign keys (which may be duplicated as part of a time-series) to be more recent than a specified date, include “AFTER” in your list of filter modifiers.