Data Source: original data - https://www.kaggle.com/chicago/chicago-energy-usage-2010
or https://data.cityofchicago.org/Environment-Sustainable-Development/Energy-Usage-2010/8yq3-m6wp
Files: Need a account to download data from kaggle, but the file has the same name for both links - energy_usage_2010.csv
How to clean Data:
To clean the data, you would have to clean it on excel for a majority of the part.
Once the data file has been downloaded delete all the columns that are not needed. The columns you'll need are Community Area Name, Census Block, Building Type, KWH total, KWH monthly, THERM total, THERM monthly, total population, average stories, average building age, occupied units percent, and renter occupied housing percentage.
Next to get the TRACT number for each block start by creating a new column for GEOID without a block number. Trim the left 4 numbers of the GEOID by using the function LEFT(D2,LEN(D2)-4). Double click the corner of the cell to apply it all the rows.
Next create a another column for the TRACT value. Trim the first 5 numbers of the new GEOID created without the the block numbers using the RIGHT function in excel. ei. RIGHT(D2,LEN(D2) - 5). The result should be the TRACT number for the block.
When applying it to the first row, double click the bottom of the cell to apply to the rest of the rows, and you should finally have a column with the TRACT value of the CENSUS block.
To get the data on 10% census tracts within the entire city for specific topics such as the ones mentioned above, create new columns and name them appropriately such as 10%_topic.
Then do the PERCENTILE function on the first row for a specific column, and click the the bottom right square of the cell where function was applied and double click. This will apply the function to all rows. For each row it will either be true or false depending on whether the row value is within the 10% percentile of any specific column.
Repeat steps 2 and 3 for each column that has data on oldest buildings, newest buildings, tallest buildings, block that use the most electricity and gas over the year, most population, most occupied unit percentage, and highest percentage of renters.
Also all the cells that have blanks for number values convert to zero.
In my project I saved the file as .xlsx file type.