Phase 8: Cleaning/ Summarizing Dataframes

Overview on Program

My Python program analyzes NBA stats from the 2023–2024 season, focusing on how player performance varies by position. My program reads a CSV file, inspects and cleans the data by removing duplicates and rows with missing values in key columns like PPG, RPG, and POS. These were mandatory and uniqueness errors.

After cleaning, the data was filtered to focus on guards, and summary tables were created using groupby and pivot table to compare average points and rebounds across positions. Visualizations like scatter, line, and bar charts highlighted clear trends. An example of a trend would be guards scoring more on average but rebounding less than other positions. These charts support the core question of the project: How does position affect player performance in the NBA?

Visual Inspection/ Data Cleaning

During the visual inspection of my DataFrame, I reviewed the first and last few rows, checked the shape, column names, and summary statistics. The original dataset had 213 rows and 29 columns. I found and removed duplicate rows using drop duplicates and dropped rows with missing values in key columns like POS, PPG, and RPG using dropna. All columns had the correct data types, so no conversions were needed. After cleaning, the number of columns remained the same, but the number of rows decreased slightly. I then ran the visually inspect function again, which confirmed that the missing values and duplicates had been successfully removed. The cleaned data was complete and ready for analysis.

Summary TAbles

I created two summary tables to compare NBA player performance by position. The first used groupby and the second used pivot table to calculate average PPG and RPG. I chose the mean to show typical performance across positions. Both tables revealed trends, like guards scoring more but rebounding less than other positions. These summaries support my project question about how player position affects performance.

Visualizations

I created four visualizations to explore trends in NBA player performance. The scatter plot shows the relationship between PPG and RPG for guards. This helps visualize how scoring and rebounding vary among them. The line chart displays average PPG by position, highlighting which positions score the most on average. The bar chart compares average RPG by position, showing how rebounding differs across roles. Lastly, the summary bar chart from the pivot table shows both PPG and RPG by position side-by-side. These charts clearly support my question about how player performance differs by position, making trends easy to see and compare.

Advantages of Python

Using Python to clean data, create pivot tables, and build visualizations offers several advantages over google spreadsheets. Python handles large datasets more efficiently. It also allows for faster, repeatable cleaning with functions like dropna and drop duplicates. Creating pivot tables and summary stats with groupby or pivot table is much quicker and more flexible than using spreadsheet formulas. Visualizations in Python with matplotlib are also more customizable and can be generated directly from cleaned data. Overall, Python truly makes the process more scalable, consistent, and less error-prone compared to manual spreadsheet work.

General Solution

This program loads NBA stats from a CSV file and analyzes player performance by position. It uses functions to keep everything organized and easy to manage. The main() function controls the flow of the program. The read_as_dataframe() reads the CSV into a DataFrame using pandas. The visually_inspect() displays the shape, columns, and a quick preview of the data. The clean_dataframe() removes duplicates and rows with missing values in key columns. The get_subset() filters for players whose position includes "G" (guards). The groupby_sum_table() and piv_sum_table() calculate average points and rebounds by position. The scatter_plot() shows the relationship between points and rebounds for guards. The line_chart() and bar_chart() show average PPG and RPG by position. The summary_chart() uses pivot table data to show PPG and RPG in one grouped chart. The program helps explore and compare NBA player stats in a clean and visual way.

Expected Output

The expected output of my program includes a printout of the original DataFrame’s shape, column names, first and last few rows, and summary statistics. After cleaning, the program outputs a cleaned version of the DataFrame with no missing values or duplicate rows. It also produces two summary tables. One was created using the groupby function showing average PPG and RPG by position, and another using a pivot table for the same stats in a more structured format. The program then displays four visualizations. A scatter plot of PPG vs. RPG for guards, a line chart of average PPG by position, a bar chart of average RPG by position, and a combined bar chart comparing PPG and RPG using the pivot table.

Page updated

Google Sites

Report abuse