College Scorecard Analysis
Introduction and Background
College Scorecard was designed by the US Department of Education to help students and parents decide on the next steps of pursuing higher education. The scorecard is designed to increase transparency, putting power in the hands of students to compare how well individual postsecondary institutions are preparing their students to be successful. The College Scorecard provides information regarding institutions, earnings, debt, and field of study-related data. Students can also access information on federal funding available, which can help in planning for loans, and debts. [1]
Being an international student in the United States I know the importance of selecting a university while being on a budget. I still remember the amount of time I spent on the internet searching for universities, in my budget, which offer Data Science courses. The College Scorecard project will assist students to choose a university, considering their finances and the interested field of study.
A study by Royal & Co shows that 18.6% of the students end up not attending their college of the first choice due to financial issues.
About the data
The data is provided through federal reporting from institutions, data on federal financial aid, and tax information. A complete set of these data for all active Integrated Postsecondary Education Data System (IPEDS) institutions that participate in Title IV programs (either by disbursing aid or through deferments) and that are not solely administrative offices are available on the Scorecard data webpage and API. [3]
The data is available on data.ed.gov [4] and include the following:
Field of study data and merged files:
'Institution-level' data files for 1996-97 through 2018-19 containing aggregate data for each institution. It includes information on institutional characteristics, enrollment, student aid, costs, and student outcomes.
'Field of study-level' data files for the pooled 2014-15, 2015-16 award years through the pooled 2016-17, 2017-18 award years containing data at the credential level and 4-digit CIP code combination for each institution. Includes information on cumulative debt at graduation and earnings one year after graduation.
Crosswalk files for 2000-01 through 2018-19 that link the Department’s OPEID with an IPEDS UNITID for each institution.
Most recent institution-level data
Includes information like net tuition revenue, institutional expenditure, Pell grants. (all factors per FTE student)
Most recent data by field of study
Aim of the project
Part one of the project will include exploratory data analysis to find answers to the following questions:
It is a general belief that studying at the main campus will give the students better work opportunities and leave them with less debt. Is this true?
Which field of study leads to the highest debt after graduation?
Bachelor's vs Master's vs Doctoral degree: which degree leads to high earning and less debt?
Part two will include, developing a model to predict the suitable university for a student depending on the field of study and the loan amount he/she can afford.
Methodology
Data Collection: from data.gov and Google GeoAPI
Data preprocessing: Data cleaning and feature extraction
Exploratory data analysis: to answer the question listed in Part I
Modeling: A Machine Learning model will be developed for predictions listed in Part II
Documenting the results
I will be talking about the project here, watch the video for more details!
Part 1
The main aim of this part is to understand the data completely. Understanding the underlying properties at the first glance is nearly impossible, and this analysis will help me with it.
Watch the video to glance through my analysis, for a detailed analysis keep reading!
Data Exploration and Cleaning
The understanding and documentation of the attributes took a long time as I to rely on the governmental documentation and decide what features to be considered for the analysis. The datasets contain a lot of privacy-suppressed values. If the instance's count is less than 20, the values are suppressed to safeguard the information of the individuals and avoid the issue of tracing. Due to the privacy suppressed values, only 20% of the dataset is being considered for the analysis. Despite the huge drop in data, I will still have ~40000 instances with information related to over 250 universities. The entries are independent and cannot be interpolated.
Dimension Reduction: The dataset initially contained 18 columns of which 4 have been removed due to redundancy.
Data Integration: Adding the following columns will be beneficial for analysis and model development.
Location of the universities (City, State, and Zipcode): The scorecard website has multiple datasets, I have used the 'Most-Recent-Cohorts-All-Data-Elements.csv' to get the location details.
Cost of living index: The Missouri Economic Research and Information Center offers a small data set with up-to-date cost-living-data-series. These are joined w.r.t state.
National ranks of the universities: The Center for World University Rankings has a dataset with national and world rankings of the universities. I have considered the national ranking as my analysis is restricted to the United States only. The dataset has multiple rankings for a single university with no additional details on what each rank indicates. On observation I found the rankings are not very apart and using the average rank seems to be a better option.
Answering the questions!
Q1: It is a general belief that studying at the main campus will give the students better work opportunities and leave them with less debt. Is this true?
To answer this question I used the Linear Regression model and trained one with the median earnings and university ranking to predict the debt the set of students graduating from a university are facing. The second model included the main campus feature along with earnings and university ranking. The 'main' column had a value of 1 if the listed university is the main campus, the value is 0 otherwise. For example, UMBC will have a value of 1 while the Shady Grove campus will have 0.
After training and predicting the debt values, I evaluated the model using the r2 score. The R-squared value talks about how well the regression model fits the observed data. Higher R-square indicated a better fit for the model. But evaluating the above-listed models led to the same R-squared values which means, the main campus factor did not really play an important role in determining the debts of students after graduation.
Q2: Which field of study leads to the highest debt after graduation?
I have answered the question in two ways:
2.1. The field of study leaving students with the highest debt
To have a fair answer I have chosen the top 50 entries w.r.t the debt instance. The field ‘Dentistry’ tops the list of students left with the highest amount of debt after graduation, followed by 'Osteopathy' and 'Medicine’. The graph on left shows all the fields from the top 50 list.
2.2 The fields with the highest number of students left with debt
To answer this I grouped the data w.r.t the course description (CIPDESC)and picked the top 5 fields. Study fields 'Business Administration', 'Registered Nursing and Nursing Research' and 'Liberal Arts and Sciences' have left the highest number of students with debt.
It can be observed that fields having the highest debt are not the fields leaving the highest number of students in debt and vice versa.
Q3: Bachelor's vs Master's vs Doctoral degree: which degree leads to high earning and less debt?
I have used a box plot to compare the three most popular degrees: bachelor's, master's, and doctoral. I have used the box plot to get an overall idea of the distribution of the data. The vertical line that goes through the box is the median value of the feature.
After comparing the box plots, we can conclude there is no degree which leaves a student with high earning and less debt. Students having a higher degree seem to have both high pay and high debt and vice versa.
Part 2
The aim of this part of the project is to suggest universities keeping the financial aspect and field of study in mind.
Watch the video to glance through my analysis!
Firstly, I will be creating a new attribute named 'Financial Factor'. This factor is a difference in the amount of money a student earns annually after graduation minus the amount of money a student spends for tuition and living in a year.
As the factor increases, the amount of money a student is saving increases. A negative value implies the student is at loss, meaning the student ended up spending more on education and is earning less after graduation.
I have initially plotted a scatter plot of the financial factor against the cost of living index to see how the universities are distributed.
I found a single outlier, which has been removed.
Scaling the columns: While training the machine learning model, if the numerical columns are not scaled, the algorithm tends to give higher values more weight than the small values irrespective of the units. To avoid this problem, I have scaled the financial factor and cost of living columns.
Clustering the data using K-means
I have clustered the data to divide the universities into best, safe and risky categories.
Conclusions from the clusters:
The best universities (clusters 2 and 3): universities having a high financial factor and less cost of living.
The safe universities (cluster 0): Universities having a less financial factor and mediocre cost of living.
The risky universities (clusters 1 and 4): These are the universities where the financial factor is less, and the cost of living is high
Here, my aim is to select the universities where students spend less amount for living and have high financial factor, meaning the amount of money they earn after graduation annually should be more than the amount of money they spend yearly to earn that degree.
If you’re wondering about the constant horizontal lines, here’s an explanation for it: the dataset has information about the universities and the courses available at the universities. This means every university has multiple entries. The cost of living index is w.r.t to the states and every state has multiple universities. As the cost of living index value of all the universities in a state is the same, we end up getting horizontal lines.
Ex: If UMBC has 10 courses and UMCP has 15 courses, all the 25 entries have the same cost of living index, which when plotted end up as a horizontal line.
Now that the clustering is done, let’s put the program to test. The course, course level, and the tuition fee are taken as dynamic inputs from the user running the program.
For this example, I have chosen the course to be Computer Science, the level of degree as Bachelors and the tuition fee to be $10,000.
You can see the list of the best universities suggested by the program. Just to verify the output I searched for the tuition fee of these universities and the fee can be seen on the right.
The fee range has been set to + or - $5000 of the entered fee. So all the universities are actually falling in the given range! Which means the project is a success!
Limitations of the College Scorecard Analysis
Every project has limitations and mine has a few too
The dataset initially had about 5000 universities but most of the entries are privacy suppressed. After data cleaning, I ended up having about only 300 universities and most of the suppressed data is related to a higher degree of education
The tuition fee in the dataset is in-state tuition so currently, the model is not very efficient for international students planning to study in the USA
For the first time, a governmental organization (U.S Department of Education) has published such information this will hopefully lead to more universities releasing information without suppressing the data. And once the data is made public the model will work efficiently for every case!
This gets us to the end of my project and I am happy to have developed a project which can be used in real-time by all the prospective students. Even though this is a simple project, it can surely be used to help students filter universities keeping the financial aspect in mind.
The project can be viewed here!
References:
US Department of Education (Documentation, 2020), https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf
https://www.naicu.edu/policy-advocacy/issue-brief-index/regulation/college-scorecard#:~:text=The%20College%20Scorecard%2C%20which%20was,diversity%20%E2%80%93%20to%20create%20institutional%20profiles
https://collegescorecard.ed.gov/assets/FullDataDocumentation.pdf
https://data.ed.gov/dataset/college-scorecard-all-data-files-through-6-2020/resources