DATA DESCRIPTION
I chose this data because I am interested in knowing how much each college in New York especially CUNY colleges were awarded Tuition Assistance Program (TAP) over the years. I found this data on data.ny.gov (data.ny.gov/Education/Tuition-Assistance-Program-TAP-Recipients-Dollars-/ich7-7ewa) and the owner of this data is the NY Open Data.
The Tuition Assistance Program (TAP), New York's largest student financial aid grant program, helps eligible New York residents attending in-state postsecondary institutions pay for tuition. TAP grants are based on the applicant’s and his or her family’s New York State taxable income. This data includes TAP award recipients and dollar amounts by college, sector groups, and Level of Study for academic years 2000-2019.
Notebook of this project can be found on github.com/agyapongeli77/data-science-project
DATA ASSESSMENT
Structure of the Data
This data is rectangular in csv format. It contains 3 quantitative columns (two of which are continuous and the one, discrete) and 5 qualitative columns (all nominal).
Granularity of the Data
Each data point in the raw dataset represents an academic year a college received TAP grants.
Scope & Completeness of the Data
The raw dataset contains CUNY Colleges, SUNY Colleges, Independent Colleges, Business Degree Granting Institutions, Non-Degree Business Schools, Chapter XXII TAP Schools and All Other Institutions. However, I will be analyzing only TAP grants awarded to only CUNY Colleges, therefore, the scope of the dataset is too big for the analysis.
Temporality of the Data
This data was created in April 2013 and updated January 26, 2021. However, the data points were collected from the year 2000 to 2019. However, I will be analyzing the 2010 to 2019 Academic Year.
Faithfulness of the Data
There is no missing data, however there are erroneous values in the dataset. The dataset has all the CUNY Community Colleges offered TAP Awards for 4 or 5 year Degree programs with the exception of the CUNY Stella & Charles Guttman Community College. I believe this is an error because Community Colleges only have 2 year programs and not 4 or 5.
DATA CLEANING
I filtered the dataset to include TAP grants for ONLY CUNY Colleges from the 2010-2019 Academic Years and dropped certain columns (ie. Sector Type, Level, TAP College Code and Federal School Code) that are not useful in analyzing our data.
I dropped rows in the dataset with CUNY Community Colleges having 4 or 5 year degree programs. I believe this is an error because community colleges have 2 year degree programs and not 4 or 5.
I also removed the rows in the dataset with the College Names ('CUNY GRAD SCH UNDERGRAD PROG' and 'CUNY GRAD CTR-SCHOL OF LABOR UG') because I am analyzing TAP grants awarded to CUNY Senior Colleges and Community Colleges.
Dataframe was also sorted by TAP College Name and Academic Year, reindexed the data frame and reordered it to have the TAP College Name as the first column.
Bar Charts
Histograms
Line Plot
Bar Chart
DUMMY VARIABLES
I created dummy variables by converting the 2 qualitative columns into quantitative variables. The columns that I converted to dummy variables are TAP Sector Group (which determined whether the CUNY college is a community college or a senior college) and also the TAP Level of Study(which determines whether the the program of study is a 2 year program(Associate Degree), 4 year program(Bachelor's Degree) or a 5 year program (Combined Bachelor's/Masters Degree).
DEPENDENT VARIABLE & INDEPENDENT VARIABLES
To successfully apply any machine learning model to the data and make predictions, it's important that the dataset is separated into independent variables and dependent variables. Dependent variable is the variable or column we are trying to predict and independent variables are the variables/parameters/columns in the dataset that will be used to predict the independent variable.
In this dataset, I am trying to predict the TAP Recipient Dollars which is the amount of New York State Tuition Assistance Program (TAP) grants CUNY Colleges were awarded so our dependent variable is TAP Recipient Dollars. This is what we will be using all the different machine learning models to predict to determine which one of them makes the best predictions.
FIRST DATA
This first data has all the TAP Recipient Headcount (Number of recipients as measured by students receiving at least one term award during the academic year.), TAP Recipient FTEs (Number of recipients as measured by academic year Full-Time Equivalents), TAP Sector Group (which determined whether the CUNY college is a community college or a senior college), TAP Level of Study (which determines whether the the program of study is a 2 year program(Associate Degree), 4 year program(Bachelor's Degree) or a 5 year program (Combined Bachelor's/Masters Degree) as the independent variables that we will use to make the predictions.
Then I split the data in 80% training data and 20% testing data with the train_test_split from the scikit learn module.
SECOND DATA
The second data has only TAP Sector Group and TAP Level of Study and as the independent variables that we will use to make the predictions.
We dropped off the TAP Recipient Headcount and TAP Recipient FTEs parameters.
Then I split the data in 80% training data and 20% testing data with the train_test_split from the scikit learn module.
SELECTED MACHINE LEARNING ALGORITHMS FOR MAKING PREDICTIONS
1. LINEAR REGRESSION MODEL
For both the 1st data and the 2nd data, I applied the linear regression model on training data and used it to make predictions on the test data. After which I found the mean squared error on both data. The mean squared error function from the scikit learn module is used to measure the amount of error on a prediction so you want to minimize your mean squared error because the lesser the mean squared error on a prediction is the better and vice versa.
After checking the mean squared error on both data, I observed that the first data had the least mean squared error of 998108675805.5972 and the second data had a mean squared error of 158290233158039.97 and what this means is that the independent variables (TAP Recipient Headcount and TAP Recipient FTEs) that we dropped from the first data has a greater impact on our dataset and when used with this model will give us the best predictions for our dataset.
2. DECISION TREE MODEL
After applying the Decision Tree model and checking for the mean squared error on both data, I observed that the least mean squared error of the first data was 1442856254433.26 with a maximum depth of 5 and the second data with a higher mean squared error of 24757505151600.73 with a maximum depth of 5.
The graph below shows that the first data has the least mean squared error when this model is applied to it and what this means is that the independent variables (TAP Recipient Headcount and TAP Recipient FTEs) that we dropped from the first data has a greater impact on our dataset and when used with this model will give us the best predictions for our dataset.
3. K-NEIGHBORS MODEL
After applying the K-Neighbors model and checking for the mean squared error on both data, I observed that the first data had the least mean squared error of 1345078792233.6118 with k as 3 and the second data with a higher mean squared error of 23850344688776.293 with k as 6.
The graph below shows that the first data has the least mean squared error when this model is applied to it and what this means is that the independent variables (TAP Recipient Headcount and TAP Recipient FTEs) that we dropped from the first data has a greater impact on our dataset and when used with this model will give us the best predictions for our dataset.
BEST REGRESSION MODEL FOR OUR DATA
Looking at all the models used, we can conclude that the linear regression model is the best model to use to make our predictions because it has the least mean squared error when used to make predictions compared to the other models applied.
CLASSIFICATION TASK
A classification task is a task where we try to predict a qualitative or categorical data for example trying to predict whether a person was diagnosed with an illness or not, whether a person is pregnant or not, and so on.
In our case, we are going to predict whether a school is a CUNY Community College or a CUNY Senior College by using a Logistic Regression Model and fit it to the TAP Recipient Headcount, TAP Recipient FTEs and TAP Recipient Dollars as the independent variables.
After applying the K-Logistic Regression Model and evaluating the model, we achieve an 83% accuracy of the model in determining whether a school is a Senior College or a Community College.
In our dataset, 0 indicates that a school is Senior College and 1 indicates that the school is a Community College.
This graph indicates CUNY Community Colleges more TAP Recipient Headciunt more than the CUNY Senior Colleges. What this means is that more students from the Community Colleges received TAP grants from 2010 to 2019 more than the Senior Colleges.