The dataset for this project was sourced from the U.S. Energy Information Administration (EIA), specifically the coal shipments and plant aggregates API. This API provides detailed insights into various coal quality metrics, including ash content, heat content, price, quantity, and sulfur content, organized by plant location and time period. The dataset was chosen due to its comprehensive coverage of the coal industry, which is crucial for understanding energy production, pricing trends, and environmental impacts. The data allows for an in-depth analysis of coal shipments to power plants across different states and years, giving a clearer view of the factors that influence energy consumption and emissions.
The EIA API offers a flexible querying system. A sample API call can retrieve annual data on coal shipments and various content metrics such as ash, heat, sulfur, and price per ton:
API URL: https://api.eia.gov/v2/coal/shipments/plant-aggregates/data/?frequency=annual&data[0]=ash-content&data[1]=heat-content&data[2]=price&data[3]=quantity&data[4]=sulfur-content&sort[0][column]=period&sort[0][direction]=desc&api_key=MB7EGjS54VIQfq1cPzgeyutY4fsUiBlnYewlcVIV
The data includes the following important fields needed for analysis:
Ash content: The percentage of ash present in the coal shipments.
Heat content: The energy generated by burning a specific amount of coal, measured in Btu per pound.
Price: The cost of the coal shipments in dollars per ton.
Quantity: The total volume of coal shipped, measured in tons.
Sulfur content: The sulfur concentration in the coal, which has a direct impact on emissions when coal is burned.
This dataset spans several years, offering a historical perspective on coal shipments and their associated environmental and economic impacts. It is a valuable resource for understanding shifts in the coal industry, trends in energy pricing, and the environmental implications of different coal types used in power generation.
UNCLEANED DATA
CLEANED DATA
Dropping Irrelevant Columns:
Certain columns, like location, which duplicated information provided in plantStateDescription, were deemed redundant and were dropped from the dataset. This reduced unnecessary complexity and helped focus the analysis on the most relevant variables. For instance, plantStateDescription already provided the necessary geographic context, making the location column superfluous.
Handling Missing Values:
Upon inspection, several columns, particularly the price column, contained missing values. Since price is a critical factor for analysis, these missing values were handled using an appropriate imputation technique or, if the data was insufficient, by removing those specific rows. This approach ensured that incomplete data did not skew the analysis results.
Converting Data Types
Some of the columns, such as price, ash-content, heat-content, and sulfur-content, were found to be stored as strings or objects instead of numerical data types (float or int). These columns were converted to the appropriate numeric data types to allow for statistical analysis and visualizations. This step is crucial for any calculations, as leaving numeric data in an incorrect format would lead to errors during analysis.
Dealing with Duplicate Entries:
Some records occurred more than once in the data, especially in the plantName and period columns where duplicate rows were present. In order to prevent certain plants or historical periods from being overrepresented in the analysis, these duplicates were found and eliminated. By taking this step, it was made sure that every data point made an equal contribution to the insights the dataset produced.