Skip to content

Tutorial 4: Transforming data containing American dates, currencies as strings and misaligned columns

whyqd (/wɪkɪd/) can support you with even the most poorly-formatted and misaligned data through judicious use of schema definitions and array transforms.

Learning outcomes

  • Develop and apply a transformation strategy
  • Apply data types to a source schema to coerce format corrections
  • Collate misaligned data columns into same-length arrays
  • Explode array columns into single-value long-form transformed data

SOURCE_DATA are from Basildon City Council and it is assumed you have familiarity with Python and Pydantic.

SOURCE_DATA = "https://github.com/whythawk/whyqd/raw/master/tests/data/raw-e07000066-tutorial-4.xlsx"
MIMETYPE = "xlsx"
SCHEMA_DESTINATION = "https://raw.githubusercontent.com/whythawk/whyqd/master/tests/data/test_schema.json"

Background

This is a similar use-case as in the first tutorial based on our openLocal.uk project, which is a quarterly-updated commercial location database, aggregating open data on vacancies, rental valuations, rates and ratepayers, into an integrated time-series database of individual retail, industrial, office and leisure business units.

In this tutorial, the source data have been deliberately made worse to illustrate commonly-occuring challenges. These are problems we regularly experience, just not always in one spreadsheet.

Our source contains the following:

  • Dates stored as US-formatted strings ("MM/DD/YYYY")
  • Numbers stored as currency strings ("£45,234.0")
  • Misaligned fields (there are four tax relief types, but we only have relief amounts for three of these)

We will use whyqd (/wɪkɪd/) to transform the data into our schema, and then pandas to convert our array values into a long-form table, using explode:

DataFrame.explode(column, ignore_index=False)

Strategy

Strategy

Our source data are in a wide format, meaning that values we would like to be in a single column are provided in columns in the source. There's a number of ways we can go about pivoting from wide to long, and here we will assemble an array of values, and then explode these fields afterwards. We must be very careful to ensure that the sequence of each array corresponds so that terms on one array refer to the term at the same position in another.

Our source data also include some peculiar data type formatting which can be very stubborn to correct. These include numbers formatted as currency strings. As example, "£45,234.0" where the £ and , are not stylistic but present in the "number". Most programs will see this as text. Adding "£45,234.0" to "£45,234.0" will get you "£45,234.0£45,234.0", not "£90,468.0".

We also have to deal with American date formats. Americans are the only country in the world to use this format, and it causes incredible headaches. How should your program interpret something like 1\04\03? Any of these terms can be any of a month, day or year. There is no way to algorithmically determine this. You - as the data curator - will need to make a format decision.

Software can speed up, simplify and make transparent any curation decisions you make, but it cannot replace the insight which a data owner brings to bear. Don't leave these decisions to the software.

Define a destination schema

We want our destination data to conform to the following structure:

la_code ba_ref occupant_name postcode occupation_state occupation_state_date prop_ba_rates occupation_state_reliefs
E07000066 177500080710 A company PO5 2SE True 2019-04-01 [98530, None, 1234, None] [small_business, None, retail, None]

We're going to reuse the schema we developed in the first tutorial but make sure that the dtype is an array so we can assemble our destination data:

import whyqd as qd

schema_destination = qd.SchemaDefinition()
schema_destination.set(schema=SCHEMA_DESTINATION)
for field in schema_destination.fields.get_all():
    if field.name in ["occupation_state_reliefs", "prop_ba_rates"]:
        field.dtype = "array"

This is our curation foundation for this tutorial and we can save it, ensuring citation and version control.

Set DIRECTORY, FILENAME and CREATOR as per your requirements.

schema_destination.save(directory=DIRECTORY, filename=FILENAME, created_by=CREATOR)

We'll reference this definition saved source path as SCHEMA_DESTINATION in the rest of this tutorial.

Source data and source schema definitions

Basildon's data are well-structured but incredibly poor quality:

PlaceRef FOIName PropAddress1 PropAddress2 PropAddress3 PropAddress4 PropAddress5 PropPostCode PropDesc RateableVal MandRlfCd MandRlf DiscRlfCd DiscRlf AddRlfCd AdditionalRlf LiabStart RenewableRV RnwEn SBRFlag ChgType
0 '0664302 Childs Property Ltd 31-33 The Broadway Wickford Essex SS11 7AD SHOP AND PREMISES £156,000 £80 £80 £80 2018-08-01 00:00:00 £80 no no V
1 '1065061 DVS GHL Basildon Limited Aviva House Southernhay Basildon Essex SS14 1EZ OFFICES INCAPABLE OF BENEFICAL OCCUPATION £157,000 £88 £88 £88 2023-06-03 00:00:00 £88 no no V
2 '1065061 Infrared Uk Lion Gp Ltd (in Admin) Aviva House Southernhay Basildon Essex SS14 1EZ OFFICES INCAPABLE OF BENEFICAL OCCUPATION £158,000 £128 £128 £128 2022-01-04 00:00:00 £128 no no V
3 '1091914 Kames Capital UK A V P Unit Trust Suite 3d Southgate House 88 Town Square Basildon Essex SS14 1DT UNDERGOING REDEVELOPMENT £158,000 £128 £128 £128 1/14/15 £128 no no V
4 '0648793 Orwell (Basildon) Limited 62-64 Town Square Basildon Essex SS14 1DT UNERGOING REDEVELOPMENT £158,000 £150 £150 £150 2/14/20 £150 no no O
5 '0648760 Orwell (Basildon) Limited 51-52 Town Square Basildon Essex SS14 1DT PREMISES UNDERGOING RECONSTRUCTION £158,000 £150 £150 £150 2/14/20 £150 no no O
6 '0648748 Orwell (Basildon) Limited 38-40 Town Square Basildon Essex SS14 1DT PREMISES UNDERGOING RECONSRTUCTION £158,000 £152 £152 £152 2/14/20 £152 no no O
7 '0648679 Orwell (Basildon) Limited 22-24 Town Square Basildon Essex SS14 1DT BLDING UNDERGOING WORKS £159,000 £168 £168 £168 2/14/20 £168 no no O
8 '0720730 Orwell (Basildon) Limited 50-52 Town Square Basildon Essex SS14 1DT UNDER RECONSTRUCTION £159,000 £177 £177 £177 2/14/20 £177 no no O
9 '0648726 Orwell (Basildon) Limited 30-36 Town Square Basildon Essex SS14 1DT SHOP & PREMISES £16,000 £200 £200 £200 2/14/20 £200 no no O

We're going to use whyqd's builtin data type coercion to interpret these problems. We do that through setting the dtype:

  • Column LiabStart is of dtype usdate,
  • Columns MandRlf, DiscRlf, and AdditionalRlf are of dtype number.
CATEGORY_FIELDS = ["MandRlfCd", "DiscRlfCd", "AddRlfCd", "SBRFlag", "ChgType"]
datasource = qd.DataSourceDefinition()
datasource.derive_model(source=SOURCE_DATA, mimetype=MIMETYPE)
schema_source = qd.SchemaDefinition()
schema_source.derive_model(data=datasource.get)
for field in schema_source.fields.get_all():
    if field.name in ["LiabStart"]:
        field.dtype = "usdate"
    if field.name in ["MandRlf", "DiscRlf", "AdditionalRlf"]:
        field.dtype = "number"
for cat_field in CATEGORY_FIELDS:
    if cat_field in datasource.get_data().columns:
        schema_source.fields.set_categories(name=cat_field, terms=datasource.get_data())

Now we can rely on whyqd to correctly interpret the meaning of the values in these columns for the crosswalk operations that follow.

Transformations with crosswalks

Let's define our crosswalk script:

SCRIPTS = [
    "NEW > 'la_code' < ['E07000066']",
    "RENAME > 'ba_ref' < ['PlaceRef']",
    "RENAME > 'occupant_name' < ['FOIName']",
    "RENAME > 'occupation_state_date' < ['LiabStart']",
    "UNITE > 'postcode' < ', '::['PropAddress1','PropAddress2','PropAddress3','PropAddress4','PropAddress5','PropPostCode']",
    "CATEGORISE > 'occupation_state_reliefs'::'exempt' < 'MandRlfCd'::['CASC','EDUC80','MAND80','PCON','POSTO2']",
    "CATEGORISE > 'occupation_state_reliefs'::'discretionary' < 'DiscRlfCd'::['DIS100','DISC10','DISC15','DISC30','DISC40','DISC50','DISCXX','POSTOF']",
    "CATEGORISE > 'occupation_state_reliefs'::'retail' < 'AddRlfCd'::['RETDS3']",
    "CATEGORISE > 'occupation_state_reliefs'::'small_business' < 'SBRFlag'::['yes']",
    "CATEGORISE > 'occupation_state'::False < 'ChgType'::['V']",
    "COLLATE > 'prop_ba_rates' < ['MandRlf', 'DiscRlf', 'AdditionalRlf', ~]"
]

Most of this should be self-explanatory from the previous tutorials, but here we must pay attention to the order of the scripts. Four of the CATEGORISE scripts refer to the same occupation_state_reliefs destination field. However, when you check the corresponding source data, you'll see that we only have three reliefs value columns. There is no relief amount for the small_business category.

If we didn't recognise this, we'd end up with four values in the array for each row the occupation_state_reliefs column, and only three in the prop_ba_rates column. We need to use a spacer modifier, here defined as ~.

This allows us to define the COLLATE script, with the ~ for the missing column:

"COLLATE > 'prop_ba_rates' < ['MandRlf', 'DiscRlf', 'AdditionalRlf', ~]"

We can now go ahead and run the crosswalk and transformation:

# Define a Crosswalk
crosswalk = qd.CrosswalkDefinition()
crosswalk.set(schema_source=schema_source, schema_destination=schema_destination)
crosswalk.actions.add_multi(terms=SCRIPTS)
crosswalk.save(directory=DIRECTORY)
# Transform a data source
transform = qd.TransformDefinition(crosswalk=crosswalk, data_source=datasource.get)
transform.process()
transform.save(directory=DIRECTORY, mimetype=DESTINATION_MIMETYPE)
# Validate a data source
DESTINATION_DATA = DIRECTORY / transform.model.dataDestination.name
TRANSFORM = DIRECTORY / f"{transform.model.name}.transform"
valiform = qd.TransformDefinition()
valiform.validate(
    transform=TRANSFORM, data_destination=DESTINATION_DATA, mimetype_destination=DESTINATION_MIMETYPE
)

Our output data look like this:

la_code ba_ref prop_ba_rates occupant_name postcode occupation_state occupation_state_date occupation_state_reliefs
200 E07000066 '1157466 [850.0, 850.0, 850.0, None] Unit 2, Great Broomfields, Cranfield Park Road, Wickford, Essex, SS12 9EP True 2021-01-01 00:00:00 [None, None, None, 'small_business']
201 E07000066 '1033734 [850.0, 850.0, 850.0, None] Unit 3, Tiffaynes Farm, Burnt Mills Road, North Benfleet, Wickford Essex, SS12 9JX False 2011-01-04 00:00:00 [None, None, None, None]
202 E07000066 '1139793 [850.0, 850.0, 850.0, None] Alpi Uk Ltd Car Spaces X 6 Alpi, At Alpi House, Miles Gray Road, Basildon, Essex, SS14 3BZ True 2018-01-01 00:00:00 [None, None, None, None]
203 E07000066 '1117835 [850.0, 850.0, 850.0, None] A12 Electrical & Technical Limited 13 Bowers Court Drive, Bowers Gifford, Basildon, Essex, SS13 2HH True 2010-01-04 00:00:00 [None, None, None, 'small_business']
204 E07000066 '1002284 [850.0, 850.0, 850.0, None] Rotamead Limited Portakabin Building At, Sadlers Hall Farm, London Road, Basildon, Essex, SS13 2HD False 2020-03-27 00:00:00 [None, None, None, None]
205 E07000066 '1099134 [850.0, 850.0, 850.0, None] The Electricity Network Company Limited Independent Distribution Network Operator, Phase 1 Gloucester Park, Broadmayne, Basildon, Essex, SS14 2EB True 2013-01-04 00:00:00 [None, None, None, None]
206 E07000066 '1014671 [850.0, 850.0, 850.0, None] Allied Self Drive, Blunts Wall Farm, Blunts Wall Road, Billericay, Essex, CM12 9SA False 2013-01-11 00:00:00 [None, None, None, None]
207 E07000066 '1126950 [850.0, 850.0, 850.0, None] Basildon & Thurrock Uni Hosp Nhs Trust Phlebotomy Office, Billericay St Andrews Centre, Stock Road, Billericay, Essex, CM12 0BH True 2015-01-04 00:00:00 [None, None, None, None]
208 E07000066 '1126938 [850.0, 850.0, 850.0, None] Basildon & Thurrock Uni Hosp Nhs Trust Phlebotomy Cubicle 1, Billericay St Andrews Centre, Stock Road, Billericay, Essex, CM12 0BH True 2015-01-04 00:00:00 [None, None, None, None]
209 E07000066 '1114461 [850.0, 850.0, 850.0, None] Veolia Pitsea Marshes Maintenance Trust Unit D White Cottages & Workshop, Wat Tyler Country Park, Pitsea Hall Lane, Pitsea, Basildon Essex, SS16 4UH True 2016-09-26 00:00:00 ['exempt', None, None, None]

Integration of transformed data into other applications

The objective of a whyqd transform isn't just archival data, but as an input in other research or applications. These arrays are very efficient for storage, but aren't particularly useful for anything else. We can quickly recover the full long-form data structure using explode applied to the array fields:

df = transform.data.copy()
df = df.explode(["occupation_state_reliefs", "prop_ba_rates"])
df.drop_duplicates(inplace=True)

Which gives us this:

la_code ba_ref prop_ba_rates occupant_name postcode occupation_state occupation_state_date occupation_state_reliefs
1382 E07000066 '1127679 5800 Gould Barbers East Anglia Limited Gould Barbers, At Tesco, Mayflower Retail Park, Gardiners Lane South, Basildon Essex, SS14 3HZ True 2018-10-01 00:00:00 retail
1383 E07000066 '0638697 5800 Infrared Uk Lion Gp Ltd (in Admin) Public Conveniences, Eastgate, Basildon, Essex, SS14 1AE True 2022-01-04 00:00:00 exempt
1386 E07000066 '1058511 5800 Workshop @, Basildon Motoring Centre, Long Riding, Basildon, Essex, SS14 1QY True 2010-02-04 00:00:00 retail
1388 E07000066 '0654079 5800 Noak Bridge Community Association Noak Bridge Village Hall, Coppice Lane, Basildon, Laindon, Essex, SS15 4JS True 1995-01-04 00:00:00 exempt
1388 E07000066 '0654079 5800 Noak Bridge Community Association Noak Bridge Village Hall, Coppice Lane, Basildon, Laindon, Essex, SS15 4JS True 1995-01-04 00:00:00 retail
1434 E07000066 '0643641 6000 South Green and District War Memorial Fund South Green Memorial Hall, Southend Road, Billericay, Essex, CM11 2PR True 1995-01-04 00:00:00 exempt
1434 E07000066 '0643641 6000 South Green and District War Memorial Fund South Green Memorial Hall, Southend Road, Billericay, Essex, CM11 2PR True 1995-01-04 00:00:00 retail
1435 E07000066 '0433089 6000 86 Pound Lane, Basildon, Pitsea, Essex, SS13 2HW True 2004-05-11 00:00:00 retail
1448 E07000066 '1162681 6100 New Life Wood West Of Unit A The Old Laboratory, Wat Tyler Country Park, Pitsea Hall Lane, Pitsea, Basildon Essex, SS16 4UH True 2021-05-31 00:00:00 exempt
1452 E07000066 '1127975 6100 Royal Voluntary Service RVS Shop At Antenatal Unit, Basildon Hospital, Nethermayne, Basildon, Essex, SS16 5NL True 2015-01-04 00:00:00 exempt