Data Source and Data Preprocessing

Data Source

The data used for the project is taken from the Federal Bureau of Investigation and Kaggle websites. We have collected hate crime data and a Mental Wellbeing Facilities targeted to the USA.

Hate Crime Dataset: We worked on multiple tables on hate crime data.

Motivation-Based: Hate crime incidents per biased motivation by state and federal. This data contains state, type, county, motivation, and crime count columns. Data is collected from the year 2015 to 2019.

Population-Based: Agency hate crime reporting by state and federal. It contains participating state/ federal, the number of participating agencies, population covered, agencies submitting incident reports, and the total number of incidents reported. We have collected the data from 2015 to 2019 years.

Crime Data source: fbi.gov

Mental Well-Being Facilities Dataset: We've collected data on Mental Well-Being facilities available in the United States. This data contains the Mental Wellbeing Facilities name, Address, Zipcode, city, county, state, and area code. The dataset contains 10210 rows. The actual is being scraped from various official documents published by the US government.

Mental Wellbeing Facilities Data Source: Kaggle

Data Preparation

We have performed the following steps to prepare the ready-to-use dataset:

Data Collection: We collected the data from various sources which are the FBI and Kaggle websites, and verified whether the objectives fit the planned analytics we intend to perform in this project.
Data discovering and profiling: We have discovered the relationships between different data elements in the raw dataset. We calculated the statistics of data-based ongoing characteristics and issues.
Data cleaning: We removed null values, ASCII characters, changed data types, Removed empty values, standardized address fields, fixed the mislabelled data, removed irrelevant data from the raw dataset, and identified and fixed the duplicates.
Data Structuring: We modeled the data and merged the dataset from the year 2015 to 2019 and organized it in a way that meets the analytical requirements.

Cleaned data files are available on Github.

Page updated

Report abuse