USA Air Pollution
by
by
GitHub Repository: https://github.com/schoi15-umbc/DATA606
Saving and protecting our environment is a key topic these days. Every issue of pollution is important and essential in our ecosystem. Out of many factors, air is essential to life and the quality of air can highly affect our health. Every year, millions of Americans suffer from adverse health impacts liked to air pollution, and tens of thousands have their lives cut short. This project focuses on air pollution with the retrieved air data.
So what is Air data?
The United States Environmental Protection Agency (EPA) provides access to air quality data (primarily from Air quality system (AQS) database) collected at outdoor monitors across the United Sates, Puerto Rico, and the U.S Virgin Islands. The data can be used to view, create visuals and graphical displays, investigate monitor locations, and so on. It is accessible to the public and can be downloaded by hourly, daily, and annual concentration data, AQI data, and spectated particle pollution data.
How does the four criteria gases air pollution impact us?
Ozone
Coughing and pain when taking a deep breath
Lung and throat irritation
Wheezing and trouble breathing during exercise or outdoor activities
Can effect everyone, but specifically people with asthma or other lung diseases, older adults, babies and children
Carbon monoxide
Breathing air with a high concentration of CO reduces the amount of oxygen that can be transported in the blood stream to critical organs like the heart and brain.
Can cause dizziness, confusion, unconsciousness and death.
Very high levels of CO are not likely to occur outdoors. However, when CO levels are elevated outdoors, they can be of particular concern for people with some types of heart disease.
Nitrogen dioxide
Increased inflammation of the airways
Worsened cough and wheezing
Reduced lung function
Increased asthma attacks
Greater likelihood of emergency department and hospital admissions.
Sulfur dioxide
Irritates the skin and mucous membranes of the eyes, nose, throat, and lungs.
Can cause inflammation and irritation of the respiratory system.
Pain when taking a deep breath, coughing, throat irritation, and breathing difficulties. High concentrations of SO2 can affect lung function, worsen asthma attacks, and worsen existing heart disease in sensitive groups.
•European ozone measurements examined through a cluster analysis (CA) of 4 years of 3-hourly ozone data from 1492 European surface monitoring stations in the Airbase database
•Propose the use of air quality index and the development of advanced data processing, analysis, and visualization techniques based on the AI-based k-clustering method, based in China.
•Novel hybrid learning method, carried out to forecast urban air quality index (AQI). Wavelet packet decomposition (WPD) was firstly performed to decompose the original AQI data into lower-frequency subseries. (China)
1.Cluster analysis of European surface ozone observations for evaluation of MACC reanalysis data
Authors: Olga Lyapina, Martin G. Schultz, and Andreas Hens
2. K-Clustering Methods for Investigating Social-Environmental and Natural-Environmental Features Based on Air Quality Index
Authors: Victor Chang; Pin Ni; Yuming L
3. A clustering-based ensemble approach with improved pigeon-inspired optimization and extreme learning machine for air quality prediction
Author: Feng Jiang a,, Jiaqi He a, Tianhai Tian b
The Data set is retrieved from United States Environmental Protection Agency (EPA).
EPA has annual datasets from 1980 - 2021 (as of 2021-05-18) categorized by concentration by monitor, Air Quality Index (AQI) by Core Based Statistical Areas (CBSA), and AQI by County.
Each daily summary file contains data for every monitor (sampled parameter) in our database for each day. These files are separated by parameter (or parameter group) to make the sizes more manageable.
This file will contain a daily summary record that is:
1) The aggregate of all sub-daily measurements taken at the monitor.
2) The single sample value if the monitor takes a single, daily sample (e.g., there is only one sample with a 24-hour duration). In this case, the mean and max daily sample will have the same value.
The daily summary files contain (at least) one record for each monitor that reported data for the given day. There may be multiple records for the monitor if:
There are calculated sample durations for the pollutant. For example, PM2.5 is sometimes reported as 1-hour samples and EPA calculates 24-hour averages.
There are multiple standards for the pollutant (q.v. pollutant standards).
There were exceptional events associated with some measurements that the monitoring agency has or may request be excluded from comparison to the standard.
(Information taken from website)
The images on the left shows the 29 columns within the dataset, along with the description for each.
Out of the 29 elements, the ones that will be used in our project are
Features:
State Name
Arithmetic mean
Latitude
Longitude
AQI
Target variable: AQI
Unit of measure
Ozone, carbon monoxide: Parts per million
Nitrogen dioxide (NO2), Sulfur dioxide :Parts per billion
Daily Summary dataset for Ozone (44201) 2020 contains 391,923 Rows, 24 Columns, and is 3,127 KB. It contains numerical and categorical data.
Daily Summary dataset for Sulfur dioxide (SO2 (42401)) 2020 contains 324,817 Rows, 24 Columns, and is 4,357 KB. It contains numerical and categorical data.
Daily Summary dataset for Carbon Monoxide (CO (42101)) 2020 contains 178,789 Rows, 24 Columns, and is 1,784 KB. It contains numerical and categorical data.
Daily Summary dataset for Nitrogen dioxide (NO2 (42602)) 2020 contains 157,726 Rows, 24 Columns, and is 2,159 KBv. It contains numerical and categorical data.
image from: https://swachhindia.ndtv.com/air-pollution-what-is-air-quality-index-how-is-it-measured-and-its-health-impact-40387/
How is air pollution different by state/county?
Clustering Analysis to see how the pollution is distributed geographically.
What are some solutions for the different clusters?
Datasets for all four types of pollution (ozone, sulfur dioxide, carbon monoxide, and nitrogen dioxide) will be used to find answers to the research questions, along with comparing the AQI for the different types of pollution.
The chart on the left shows the distribution for each criteria gas. Although it is not detailed, it shows a quick overview of how the AQI is distributed.
Box Plot for Each State's AQI Per Criteria Gas
Each plot visually shows all the state's distribution of numerical data and skewness through displaying the data quartiles (or percentiles) and averages, along with the outliers. Because there are many states, it is hard to see the state's name; two additional charts were generated to show the 10 states with the highest, and the lowest AQI. Also, some noticeable factors are stated.
Ozone
10 Highest AQI States
10 Lowest AQI States
Arizona has high AQI with highest and the most outliers. This shows that the AQI in the state seems to be very inconsistent.
On the other hand, Alaska and Hawaii has a low AQI with no outlier. The AQI for ozone seems to be consistent.
Highest : Arizona, Colorado, New Mexico, Utah
Lowest: Hawaii, Washington, Oregon, Alaska
2. Carbon Monoxide
Oregon is the only state that has very high AQI with the outliers.
Highest: Georgia, Arizona, Alaska, California
Lowest: South Carolina, Wyoming, Mississippi, Nebraska
3. Nitrogen Dioxide
Nevada has a high AQI with no outliers. This indicates that the AQI is very consistent.
Highest: Nevada, Idaho, Arizona, Georgia
Lowest: Montana, Wyoming, New Hampshire, North Dakota
4. Sulfur Dioxide
Overall, most of the states have many outliers which suggests that sulfur dioxide AQI is inconsistent.
Alaska has highest AQI with some low outliers.
Hawaii, Texas, and Virginia has low AQI but very high outlier.
Highest: Alaska, West Virginia, Alabama, Illinois
Lowest: Wyoming, New Mexico, New Jersey, New Hampshire
K-Means Clustering
In order to find the best analyzation for our dataset, the K-Means clustering model was ran to see how many clusters, and where the clusters were. The four data sets (four types of criteria gases) were ran separately, with the elements that had an AQI over the limit of a good-moderate pollution rate (In other words, the clusters showed the states/regions that had air pollution for each type of gas).
In order to find the number of clusters, the elbow method was used and generated. Below shows the generated model for the elbow method.
Ozone: 3 clusters
Nitrogen Dioxide: 3 clusters
Sulfur Dioxide: 4 clusters
Carbon Monoxide: 4 clusters
Ozone Clustering
Nitrogen Dioxide Clustering
Sulfur Dioxide Clustering
Carbon Monoxide Clustering
-Sulfur Dioxide’s air pollution is the most severe out of the 4 gases
-Carbon Monoxide’s air pollution is the least sever out of the 4 gases
-Central States have the best AQI overall
The clusters seem to be generated by regions (West, Central, East)
1.Ozone
-Limiting driving
-Maintain vehicles/tools well
-Try to use tools without motors.
2. Carbon Monoxide
-Regular inspections, maintenance
-Proper storage in house
3. Nitrogen Dioxide
-Manage and Reduce Emissions
-Conserve energy, Less driving
4. Sulfur Dioxide
- Set regulations by State
Find more details for the clusters.
Find a detailed dataset that has the indicators for pollution.
Generate clusters using time (See when the pollution is more severe).
Think of other analysis/ machine learning that I can do with the dataset.
Executive Notebook / Presentation / Youtube
Sooyeon Choi's Linkedin Profile
Sooyeon Choi's Github
United States Environmental Protection Agency. (n.d.). Download files | AirData | US EPA. Retrieved from https://aqs.epa.gov/aqsweb/airdata/download_files.html#Daily
https://www.epa.gov/co-pollution/basic-information-about-carbon-monoxide-co-outdoor-air-pollution
https://www.lung.org/clean-air/outdoors/what-makes-air-unhealthy/nitrogen-dioxide
https://swachhindia.ndtv.com/air-pollution-what-is-air-quality-index-how-is-it-measured-and-its-health-impact-40387/