Case Study 1: Data Cleaning and Transformation project using Google's BigQuery.
Preview
Attached is a link to a sample Nashville housing data set for cleaning and transforming by executing the following processes;
Deleting unnecesary columns
Standardizing the date format
Populate the property address data
Splitting individual columns
Correct SoldAsVacant entries
Removing duplicates
https://docs.google.com/spreadsheets/d/1ibH_xtU1D7w8bpXZcH7uC1bpwBg8lkV6/edit?usp=sharing&ouid=102413136241949900770&rtpof=true&sd=true
Exploring the data set
Query - https://console.cloud.google.com/bigquery?sq=322571174750:44d4d62a5f4d479b889bc16c627b390c
Results - https://drive.google.com/file/d/1o3EttJ2vlLGQkCC-j26anoY05byzCgA_/view?usp=sharing
NB : - Employing the services of a temp table in the subsequent queries to ease working the requisite number of columns.
- Considering bigquery sandbox does not permit use of Data Definition(alter, drop) and Data Manipulation(update, delete) language, all of the linked queries are select statements with their respective changes to the data set. However there are suggested queries as substitutes for use in a biquery billing account which allows use of DDL and DML.
Remove unnecessary columns
Query -https://console.cloud.google.com/bigquery?sq=322571174750:54508bc019864c90b0cbc054c2d3bc3e
Results - https://drive.google.com/file/d/1CqWQ2HBjQAd2Pl9aLRmNSx1VghQb1FPS/view?usp=sharing
Standardize Date Fomart
Convert or change SaleDate datatype from string to date.
Query - https://console.cloud.google.com/bigquery?sq=322571174750:f8272321ce07478bb6cb14dd1ff16c99
Results - https://drive.google.com/file/d/1YNpoIM7NtbaU6jK6jPllyN6g7qgszQJw/view?usp=sharing
Written Query - 1. alter table [tablename] add column converted_SaleDate date;
- 2. update [tablename] set SaleDate = cast(SaleDate as date);
Populate the Property Address
There are null values in PropertyAddress, they need to be filled on basis of a reference point like the column ParcelID. Considering the address of a property does not change, we can use this reference column to match ParcelID(s) to their respective PropertyAddress(es).
Query to check for null values - https://console.cloud.google.com/bigquery?sq=322571174750:cc187d0fb0754f10a50031b3e60b0f08
Null Values - https://drive.google.com/file/d/1FL5dvzPidO0gZcoru1MLPUDDLUs-JmPD/view?usp=sharing
When populating, use a self join, two columns are required, one on which to join which has the same column values and the other with different(unique), to help locate null PropertyAddress fields whose corresponding ParcelID fields have correspoding PropertyAddress fields which are populated in other observations.
To put it simply, if a PropertyAddress in a certain observation is null, we populate it with the values another PropertyAddress that has the same ParcelID but isn't null in another observation.
Self Join Query to check for ParcelIDs with corresponding PropertyAddresses - https://console.cloud.google.com/bigquery?sq=322571174750:dd7ea2e1904040a786c731ac632f443f
Reults - https://drive.google.com/file/d/1xQVjmAEqU2pCc3otCd6Lxm6rdMaPRbzl/view?usp=sharing
Populating the null fields with the last column - https://console.cloud.google.com/bigquery?sq=322571174750:936aedc00887401f9b8d3c9dd892cbb8
Results - https://drive.google.com/file/d/1btWI0hkwmsn5k9hz6FrMttk7UEIQp_pa/view?usp=sharing
Note - ISNULL function can also be employed and better when using the Update query to make permanent changes to the dataset.
Splitting individual columns
Take a look at the PropertyAddress which is a single column stated in the fomart [address , city] - https://console.cloud.google.com/bigquery?sq=322571174750:936aedc00887401f9b8d3c9dd892cbb8
Results - https://drive.google.com/file/d/1J0u49qCf1LNw2tDBHDaUd5hkkapJHqm5/view?usp=sharing
The goal is to break the single column into 2 individual ones, namely Property_address, Property_city.
Query - https://console.cloud.google.com/bigquery?sq=322571174750:de4701577d944057a009e252487daaad
Results - https://drive.google.com/file/d/1HdK_I_9y0570ggmbVeYhxrCJyKA6Vl98/view?usp=sharing
Correct SoldAsVacant entires
There are incomplete entries in the column SoldAsVacant, 'Y' instead of 'Yes' and 'N' instead of 'No' - https://console.cloud.google.com/bigquery?sq=322571174750:be3df11508c446298c3bb1e3160b845b
Results - https://drive.google.com/file/d/1z9YD2B3JY9ILjFuBm6TQTlcU42mqH7Qe/view?usp=sharing
Correcting the innacuracy - https://console.cloud.google.com/bigquery?sq=322571174750:be3df11508c446298c3bb1e3160b845b
Results - https://drive.google.com/file/d/1IjhUXUwg6ddZU81ixeqaiOFoXiq7ipLj/view?usp=sharing
Removing Duplicates
Query - https://console.cloud.google.com/bigquery?sq=322571174750:be3df11508c446298c3bb1e3160b845b
Results - https://drive.google.com/file/d/1NoDqfuYAJN4lpSXC64plHQKsaJYzGBPP/view?usp=sharing