Overview
The site provides a summary of my work on my Data Science Capstone Project which involves identifying a selected feature(s) that can predict a high school’s academic achievement as measured by their proficiency performance on standardized state assessments for reading and math.
This project will contain 3 Phases that will provide an overview of the background, literature, data, methodology, process, results and summary of the findings with next steps.
Phase 1
The federal department of education utilizes assessment proficiency as a measure of academic achievement to evaluate districts and school's performance on state content standards. The Every Student Succeeds Act (ESSA) is a US law passed in December 2015 that governs the United States K–12 public education policies[8].
Each state is required by ESSA law to administer standardized assessments annually to assess, evaluate and ensure educational equity for all students within their state.
Performance on standardized assessments can be influenced by several factors, such as, school locale, staffing, poverty, etc.
Various studies have shown that there is a correlation between school geography, environmental factors, and various socioeconomic conditions (specifically, poverty level) on student outcomes such as academic achievement and attendance [1, 2, 5].
Studies conducted at lower grade levels have shown that various socioeconomic factors can predict student’s academic performance on assessments.
Tienken et al. (2017) conducted a study in New Jersey middle schools identified that poverty, family income and educational attainment based on census data could predict student proficiency on assessment close to 80% of the time [3].
Pennington, J. (2013) research the association between district level characteristics and statewide tests and identified that poverty indicator as defined by the percentage of free or reduced lunch had the most impact on test score levels [6,7].
Gaps in literature: Few studies have looked at publicly available school level dataset to identify the features that could predict proficiency performance on state assessments.
Similar to these studies, my project will look at similar factors, such as neighborhood poverty and school locale, to determine which are good predictors of student outcome (i.e. proficiency on state assessments). In addition, I will also incorporate several other factors like school staffing, school characteristics, school status, etc., using school level datasets from various federal data sources and look at all high schools across the country.
My project goal is to utilize school level features (such as, school neighborhood poverty, staffing, course offerings, enrollment, etc.) to predict a high school's assessment proficiency achievement (as measured by proficiency on state assessments) in math and reading for high schools across the country.
I hope my project will help education policy makers utilize this information to identify at-risk schools at an early stage and thereby implement targeted interventions, funding or resources towards such schools to help improve their educational outcomes in the future.
Can school-level features predict a school's proficiency level on state assessments for math vs reading?
Is there a difference in the relative feature importance for math vs reading? Or is it the same?
Data will be collected from the following repositories for a single school year (i.e. SY 2017-18)
Common Core Dataset (CCD): contains data files pertaining to school directory, membership, enrollment, and finance for all public schools across the country.
Civil Rights Data Collection (CRDC): collects school level data provided by the districts on a biennial survey that gathers information on school characteristics, course enrollments, staffing, etc.
EdFacts: provides state assessment proficiency data for reading and math at the school level for each state.
EDGE data: provides school neighborhood poverty estimates based on the Census data.
Phase 1: Download, clean and process dataset
Download data files from each of the repositories
Combine all the data files
Clean and apply the necessary transformations
Produce a final master file for further analysis
Phase 2: EDA and Model Construction
Exploration
Feature Engineering
Data Modeling
Phase 3: Execution and Interpretation
Revised Data Modeling and tuning
Feature Engineering (as needed)
Model evaluation
Provides an brief overview of the project, literature and datasets
Phase 2
Image above depicts the process of combining files from the CCD data source to generate the master data files.
About the Data:
As illustrated in the methodology, data was extracted from various sources (CCD, CRDC, Edge and EdFacts). Each source contained multiple data files and each of these files were cleaned, processed and combined into a single data file for each data source. Finally, the data files from each source were merged sequentially to create two master data files that will utilize for further exploration and data modeling.
Two datasets were created for data exploration and modeling:
master_reading.csv: This data file contains 14,507 non-null rows and 18 columns including the target variable.
master_math.csv: This data file contains 13,799 rows and 22 columns including the target variable.
Details about the specific target and features included are shown below.
Showing at the top 5 rows of the reading dataset
Note: Some of the features have been truncated in this view. What is key to note is that I have one categorical feature (Title1_Status) which will need coded into one-hot vectors for use in the data modelling step.
I calculated the Pearson correlation coefficient of each feature with the target variables: [Percent_Reading_Proficient] or [Percent_Math_Proficient] to get a basic idea of how the features relate. Visualization of a few of the slightly correlated variables are shown below.
It seems like a few features (IPR_estimate, No.ofAP_courses_offer, Total_students_tookAP) have a moderate to weak positive correlation to the target (Percent_Reading_Proficient), and a couple are somewhat negatively correlated (School_type).
It seems like a few features (IPR_estimate, No.ofAP_courses_offer, Total_AP_math_students have a weak to moderate positive correlation to the target (Percent_Math_Proficient), and a couple are somewhat negatively correlated (School_type).
In order to prep the data for machine learning, I applied the following feature engineering processes:
Regression Models
Initially I utilized various regression models, such as Linear Regression, Random Forrest Regressor, SVM Regression and Knn to train and predict the target variables given the different features within each dataset.
The key metrics I reviewed for the regression models were R-squared and Root mean squared error.
R-squared is a statistical measure of how well the data fits to the regression lines. This statistic indicates the percentage of the variance in the dependent variable that can be explained by independent variables.
Root mean squared error is standard deviation of the residuals (prediction errors).
Below are some of the results that were attained from this initial approach for the math and reading datasets
Results of the regression models on the math dataset alone.
Fig 1. The Random Forest Model had the lowest RMSE at 18.93 compared to other models indicating that the model was a better fit at predicting the proficiency scores in math.
Fig 2. The Random Forest Model had the highest R-squared values indicating that only 36% of the variance in proficiency scores for reading was accounted for by the features.
Table 1. Shows the RMSE and R-squared values of the different models on predicting the math proficiency percentages.
Results of the regression models on the reading dataset alone.
Fig 3. The Random Forest Model had the lowest RMSE at 18.90 compared to other models indicating that the model was a better fit at predicting the proficiency scores in reading.
Fig 4. The Random Forest Model had the highest R-squared values indicating that only 35% of the variance in proficiency scores for reading was accounted for by the features.
Table 2. Shows the RMSE and R-squared values of the different models on predicting the reading proficiency percentages.
Based on these results above we see that only the Random Forest Regressor model performed slightly better than the other 3 regression models. However, the low R-squared values indicate that the selected features are not explaining much of the variance seen in the proficiency scores for either reading or math.
Thus I will consider a different approach to solve my problem in the next Phase.
Provides an brief overview of the EDA and model construction thus far
Phase 3
Fig 5. The chart above shows the distribution of the math proficiency percentages and how the data is binned into 4 parts based on equal frequency.
Given the outcomes of the Regression Models from the earlier attempt, I decided to revise my approach to address my project problem. I decided to discretize my continuous target variables by binning them into quantiles (1-Low, 2-Moderate, 3-High and 4-Very High) based on equal frequency distribution of the percentages in the dataset.
Once the target variables were appropriately recoded, I reapplied the previously discussed feature engineering steps and began training and testing several classification models.
Classification Models
I utilized various classification models, such as Logistic Regression, Random Forrest Classifier, Gradient Boost Classifier, Extratrees Classifier, SVM Classifier and Knn Classifier to train and predict the target variables given the different features within each dataset.
The key metrics I reviewed for the classification models were Accuracy of the model, Precision, Recall and F1-score.
Accuracy is a metric that determines the number of predictions the model predicted accurately over the total number of predictions.
Precision is the accuracy of positive predictions (TP)/(TP+FP).
Recall or sensitivity is the number of correct (TP) predictions to the total number of actual instances in the class (TP+FN).
F1-score is harmonic mean between precision and recall.
Below are some of the results that were attained from this approach for the math and reading datasets
Results of the classification models on the reading dataset alone.
Fig 6. As shown above, the Gradient boosting model had the highest accuracy compared to the others
Table 3. The Gradient Boosting model (48%) performed better than the other classifiers in accuracy as well as in precision, recall and f1-score.
Results of the classification models on the math dataset alone.
Fig 7. As shown in the chart above, the Gradient boosting model had the highest accuracy compared to the others
Table 4. The Gradient Boosting model (49%) performed better than the other classifiers in accuracy as well as in precision, recall and f1-score.
Using the Gradient Boosting model, the relative feature importance scored was determined to identify which of the selected is most relevant in predicting the target variable, which is proficiency labels for math and reading.
The school neighborhood income poverty ratio had the highest relative feature importance score for predicting both math and reading proficiency labels.
Fig 8. IPR estimate had the highest relative feature importance score for predicting reading proficiency labels.
Fig 9. IPR estimate had the highest relative feature importance score for predicting math proficiency labels.
Of the different classification models that were utilized, Gradient Boosting Classifier had the highest accuracy, precision, recall and f1-score. This is not surprising because Gradient Boosting models utilized a boosting method to iteratively learn from each of the weak learners to build a stronger model [4].
Although the final model was successful at predicting the proficiency labels for math and reading roughly 50% of the time given the selected school level features, there is still room for improvement and further work can be done to improve the overall accuracy of the model.
Binning of the target variable may contribute to some loss of information or power
Selection of features from the various data sources were limited due to missing data or suppression being applied for privacy reasons.
Utilize other binning strategy, such as Entropy-based binning, for assigning labels to the target variables.
Incorporate economic, education level, housing, demographics and crime data from census database on neighborhoods surrounding the schools.
Utilize data from multiple years and additional grades levels to determine if there are any underlying factors affecting the target.
Final presentation with summary of findings and future work.
Barshay, J. (2019). An analysis of achievement gaps in every school in America shows that poverty is the biggest hurdle. https://hechingerreport.org/an-analysis-of-achievement-gaps-in-every-school-in-america-shows-that-being-poor-is-the-biggest-hurdle/
Berman, J. D., McCormack, M. C., Koehler, K. A., Connolly, F., Clemons-Erby, D., Davis, M. F., Gummerson, C., Leaf, P. J., Jones, T. D., & Curriero, F. C. (2018). School environmental conditions and links to academic performance and absenteeism in urban, mid-Atlantic public schools. International journal of hygiene and environmental health, 221(5), 800–808. https://doi.org/10.1016/j.ijheh.2018.04.015
Christopher H. Tienken, Anthony Colella, Christian Angelillo, Meredith Fox, Kevin R. McCahill & Adam Wolfe. (2017). Predicting Middle Level State Standardized Test Results Using Family and Community Demographic Data. RMLE Online, 40:1, 1-13, DOI: 10.1080/19404476.2016.1252304
Kurama, V. (2021, April 9). Gradient Boosting for Classification. Paperspace Blog. https://blog.paperspace.com/gradient-boosting-for-classification/.
Muskan097. (2020, October 11). Choosing Evaluation Metrics For Classification Model. Analytics Vidhya. https://www.analyticsvidhya.com/blog/2020/10/how-to-choose-evaluation-metrics-for-classification-model/.
Panizzon, D. (2014). Impact of Geographical Location on Student Achievement: Unpacking the Complexity of Diversity. 10.1007/978-3-319-05978-5_3.
Pennington, J. (2013). District characteristics: What factors impact student achievement. https://educateiowa.gov/documents/newsroom/2014/12/district-characteristics-what-factors-impact-student-achievement
The U.S. Department of Education. (n.d.). Final Regulations: Assessments—Title I Parts A & B. Retrieved from https://www2.ed.gov/policy/elsec/leg/essa/essaassessmentfactsheet1207.pdf