This Python program analyzes teacher salary data across U.S. states, focusing on jobs within the Educational Services industry. The program loads and cleans the dataset, filters to relevant rows, computes average salaries, and generates summary tables and visualizations. It identifies and resolves data errors like missing values, duplicates, and incorrect data types.
TThe program is organized into modular functions. It uses pandas to read and clean the dataset, apply filters, compute statistics, and summarize data with groupby and pivot tables. Visualizations are created using matplotlib to identify trends in average teacher salaries by area and occupation. The program follows a clear sequence: read → inspect → clean → analyze → visualize.
df: Original DataFrame from CSV
cleaned_df: Cleaned version of df
teacher_subset: Filtered subset where occupation contains “Teacher”
avg_salary: Mean salary for the subset
group_df: Summary table grouped by area
pivot_df: Pivot summary table of area vs occupation
A cleaned DataFrame with errors corrected
Printed filtered subset and mean salary
Two summary tables: groupby() and pivot_table()
Four visualizations:
Scatter plot of salary vs area
Line chart of mean salary by area
Bar chart of mean salary by area
Stacked bar chart from pivot table
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45157 entries, 0 to 45156
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AREA 45157 non-null int64
1 AREA_TITLE 45157 non-null object
2 NAICS 45157 non-null int64
3 NAICS_TITLE 45157 non-null object
4 I_GROUP 45157 non-null object
5 OCC_CODE 45157 non-null object
6 OCC_TITLE 45157 non-null object
7 O_GROUP 45157 non-null object
8 TOT_EMP 45157 non-null object
9 EMP_PRSE 45157 non-null object
10 PCT_TOTAL 45157 non-null object
11 H_MEAN 45157 non-null object
12 A_MEAN 45157 non-null object
13 MEAN_PRSE 45157 non-null object
14 H_PCT10 45157 non-null object
15 H_PCT25 45157 non-null object
16 H_MEDIAN 45157 non-null object
17 H_PCT75 45157 non-null object
18 H_PCT90 45157 non-null object
19 A_PCT10 45157 non-null object
20 A_PCT25 45157 non-null object
21 A_MEDIAN 45157 non-null object
22 A_PCT75 45157 non-null object
23 A_PCT90 45157 non-null object
24 ANNUAL 8464 non-null object
25 HOURLY 133 non-null object
dtypes: int64(2), object(24)
memory usage: 9.0+ MB
None
OCC_TITLE
All Occupations 471
Educational Instruction and Library Occupations 455
Management Occupations 452
Office and Administrative Support Occupations 449
Business and Financial Operations Occupations 444
...
Drafters, All Other 2
Desktop Publishers 2
Dancers 2
Credit Authorizers, Checkers, and Clerks 2
Hazardous Materials Removal Workers 2
Name: count, Length: 523, dtype: int64
AREA AREA_TITLE NAICS NAICS_TITLE I_GROUP \
count 45157.000000 45157 45157.000000 45157 45157
unique NaN 54 NaN 8 3
top NaN Texas NaN Educational Services 4-digit
freq NaN 1799 NaN 23264 21893
mean 29.725115 NaN 453771.950130 NaN NaN
std 16.305335 NaN 267255.693554 NaN NaN
min 1.000000 NaN 61.000000 NaN NaN
25% 17.000000 NaN 61.000000 NaN NaN
50% 29.000000 NaN 611000.000000 NaN NaN
75% 42.000000 NaN 611300.000000 NaN NaN
max 78.000000 NaN 611700.000000 NaN NaN
OCC_CODE OCC_TITLE O_GROUP TOT_EMP EMP_PRSE ... H_MEDIAN \
count 45157 45157 45157 45157 45157 ... 45157
unique 523 523 3 1768 501 ... 5170
top 00-0000 All Occupations detailed 40 ** ... *
freq 471 471 38394 3012 1405 ... 8691
mean NaN NaN NaN NaN NaN ... NaN
std NaN NaN NaN NaN NaN ... NaN
min NaN NaN NaN NaN NaN ... NaN
25% NaN NaN NaN NaN NaN ... NaN
50% NaN NaN NaN NaN NaN ... NaN
75% NaN NaN NaN NaN NaN ... NaN
max NaN NaN NaN NaN NaN ... NaN
H_PCT75 H_PCT90 A_PCT10 A_PCT25 A_MEDIAN A_PCT75 A_PCT90 ANNUAL HOURLY
count 45157 45157 45157 45157 45157 45157 45157 8464 133
unique 6025 6862 6619 7780 9185 10745 12121 1 1
top * * * * * * # True True
freq 8691 8691 382 382 382 382 822 8464 133
mean NaN NaN NaN NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN NaN NaN NaN
[11 rows x 26 columns]
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44770 entries, 0 to 44769
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 AREA 44770 non-null int64
1 AREA_TITLE 44770 non-null object
2 NAICS 44770 non-null int64
3 NAICS_TITLE 44770 non-null object
4 I_GROUP 44770 non-null object
5 OCC_CODE 44770 non-null object
6 OCC_TITLE 44770 non-null object
7 O_GROUP 44770 non-null object
8 TOT_EMP 44770 non-null object
9 EMP_PRSE 44770 non-null object
10 PCT_TOTAL 44770 non-null object
11 H_MEAN 44770 non-null object
12 A_MEAN 44770 non-null float64
13 MEAN_PRSE 44770 non-null object
14 H_PCT10 44770 non-null object
15 H_PCT25 44770 non-null object
16 H_MEDIAN 44770 non-null object
17 H_PCT75 44770 non-null object
18 H_PCT90 44770 non-null object
19 A_PCT10 44770 non-null object
20 A_PCT25 44770 non-null object
21 A_MEDIAN 44770 non-null object
22 A_PCT75 44770 non-null object
23 A_PCT90 44770 non-null object
24 ANNUAL 8436 non-null object
25 HOURLY 0 non-null object
dtypes: float64(1), int64(2), object(23)
memory usage: 8.9+ MB
None
OCC_TITLE
All Occupations 471
Management Occupations 450
Educational Instruction and Library Occupations 450
Office and Administrative Support Occupations 449
Business and Financial Operations Occupations 437
...
Boilermakers 2
Cost Estimators 2
Lighting Technicians 2
Firefighters 2
Desktop Publishers 2
Name: count, Length: 519, dtype: int64
AREA AREA_TITLE NAICS NAICS_TITLE I_GROUP \
count 44770.000000 44770 44770.000000 44770 44770
unique NaN 54 NaN 8 3
top NaN Texas NaN Educational Services 4-digit
freq NaN 1789 NaN 23076 21694
mean 29.720281 NaN 453693.846728 NaN NaN
std 16.305373 NaN 267298.464516 NaN NaN
min 1.000000 NaN 61.000000 NaN NaN
25% 17.000000 NaN 61.000000 NaN NaN
50% 29.000000 NaN 611000.000000 NaN NaN
75% 42.000000 NaN 611300.000000 NaN NaN
max 78.000000 NaN 611700.000000 NaN NaN
OCC_CODE OCC_TITLE O_GROUP TOT_EMP EMP_PRSE ... H_MEDIAN \
count 44770 44770 44770 44770 44770 ... 44770
unique 519 519 3 1768 501 ... 5169
top 00-0000 All Occupations detailed 40 ** ... *
freq 471 471 38038 2962 1397 ... 8436
mean NaN NaN NaN NaN NaN ... NaN
std NaN NaN NaN NaN NaN ... NaN
min NaN NaN NaN NaN NaN ... NaN
25% NaN NaN NaN NaN NaN ... NaN
50% NaN NaN NaN NaN NaN ... NaN
75% NaN NaN NaN NaN NaN ... NaN
max NaN NaN NaN NaN NaN ... NaN
H_PCT75 H_PCT90 A_PCT10 A_PCT25 A_MEDIAN A_PCT75 A_PCT90 ANNUAL HOURLY
count 44770 44770 44770 44770 44770 44770 44770 8436 0
unique 6023 6855 6617 7779 9184 10744 12120 1 0
top * * 24,960 24,960 24,960 # # True NaN
freq 8436 8436 213 86 46 208 817 8436 NaN
mean NaN NaN NaN NaN NaN NaN NaN NaN NaN
std NaN NaN NaN NaN NaN NaN NaN NaN NaN
min NaN NaN NaN NaN NaN NaN NaN NaN NaN
25% NaN NaN NaN NaN NaN NaN NaN NaN NaN
50% NaN NaN NaN NaN NaN NaN NaN NaN NaN
75% NaN NaN NaN NaN NaN NaN NaN NaN NaN
max NaN NaN NaN NaN NaN NaN NaN NaN NaN
[11 rows x 26 columns]
Filtered Subset of Teachers in Educational Services:
AREA_TITLE OCC_TITLE A_MEAN
731 Alabama Agricultural Sciences Teachers, Postsecondary 104480.0
732 Arizona Agricultural Sciences Teachers, Postsecondary 89130.0
733 Arkansas Agricultural Sciences Teachers, Postsecondary 59470.0
734 California Agricultural Sciences Teachers, Postsecondary 121140.0
735 Colorado Agricultural Sciences Teachers, Postsecondary 91600.0
... ... ... ...
42628 Wyoming Special Education Teachers, Middle School 65850.0
42630 Wyoming Special Education Teachers, Secondary School 65620.0
42631 Wyoming Special Education Teachers, Secondary School 65620.0
42633 Wyoming Substitute Teachers, Short-Term 34770.0
42634 Wyoming Substitute Teachers, Short-Term 34770.0
[4690 rows x 3 columns]
Average Annual Salary (A_MEAN) across subset: $82,984.55