CMP 485 Techniques In Data Science Project

New York State Tuition Assistance Program (TAP) Data Analysis

DATA DESCRIPTION

I chose this data because I am interested in knowing how much each college in New York especially CUNY colleges were awarded Tuition Assistance Program (TAP) over the years. I found this data on data.ny.gov (data.ny.gov/Education/Tuition-Assistance-Program-TAP-Recipients-Dollars-/ich7-7ewa) and the owner of this data is the NY Open Data.

The Tuition Assistance Program (TAP), New York's largest student financial aid grant program, helps eligible New York residents attending in-state postsecondary institutions pay for tuition. TAP grants are based on the applicant’s and his or her family’s New York State taxable income. This data includes TAP award recipients and dollar amounts by college, sector groups, and Level of Study for academic years 2000-2019.

Notebook of this project can be found on github.com/agyapongeli77/data-science-project

DATA ASSESSMENT

Structure of the Data

This data is rectangular in csv format. It contains 3 quantitative columns (two of which are continuous and the one, discrete) and 5 qualitative columns (all nominal).

Granularity of the Data

Each data point in the raw dataset represents an academic year a college received TAP grants.

Scope & Completeness of the Data

The raw dataset contains CUNY Colleges, SUNY Colleges, Independent Colleges, Business Degree Granting Institutions, Non-Degree Business Schools, Chapter XXII TAP Schools and All Other Institutions. However, I will be analyzing only TAP grants awarded to only CUNY Colleges, therefore, the scope of the dataset is too big for the analysis.

Temporality of the Data

This data was created in April 2013 and updated January 26, 2021. However, the data points were collected from the year 2000 to 2019. However, I will be analyzing the 2010 to 2019 Academic Year.

Faithfulness of the Data

There is no missing data, however there are erroneous values in the dataset. The dataset has all the CUNY Community Colleges offered TAP Awards for 4 or 5 year Degree programs with the exception of the CUNY Stella & Charles Guttman Community College. I believe this is an error because Community Colleges only have 2 year programs and not 4 or 5.

DATA CLEANING

I filtered the dataset to include TAP grants for ONLY CUNY Colleges from the 2010-2019 Academic Years and dropped certain columns (ie. Sector Type, Level, TAP College Code and Federal School Code) that are not useful in analyzing our data.

I dropped rows in the dataset with CUNY Community Colleges having 4 or 5 year degree programs. I believe this is an error because community colleges have 2 year degree programs and not 4 or 5.

I also removed the rows in the dataset with the College Names ('CUNY GRAD SCH UNDERGRAD PROG' and 'CUNY GRAD CTR-SCHOL OF LABOR UG') because I am analyzing TAP grants awarded to CUNY Senior Colleges and Community Colleges.

Dataframe was also sorted by TAP College Name and Academic Year, reindexed the data frame and reordered it to have the TAP College Name as the first column.

SINGLE VARIABLE DISTRIBUTION PLOTS

Bar Charts

This histogram shows the number of times each CUNY College received a TAP grant over a period of 10 years (2010 to 2019 Academic Year). Medgers Evers College & NYC City College of Technology received TAP grants for all their 3 types of Degree Programs (i.e.. 2 yr-Associate Degree, 4yr-Bachelors & 5 yr-Combined Bachelors/Masters program of study) every year within the 10 year academic period.

College of Staten Island and John Jay College did not receive TAP grants for all the 3 types of programs they offer within the 10 year period. Lehman College to Brooklyn College offered only 2 programs (i.e. 4 yr & 5 yr ) and hence received TAP grants for either all the programs or some during the 10 year period.

All the community colleges received TAP grants for all 10 years except CUNY Stella & Charles Guttman Community College because the school began was founded in 2011 and started receiving TAP in 2012.

The distribution is higher with CUNY Senior Colleges because there are more senior colleges than community colleges.

In addition, senior colleges have their 4 year program of study with the option of either a 5 year or 2 year program.

Histograms

The distribution is higher with 1000 or less students in the schools receiving TAP, followed by the range of 5000 to 6500 students.

This means many of the schools had 1000 or less students in any of their degree programs (i.e.. 2 yr-Associate Degree, 4yr-Bachelors & 5 yr-Combined Bachelors/Masters program of study) receiving TAP followed by schools having between 5000 to 6500 students in any of the degree programs.

The distribution is higher with schools receiving $2.5 million or less in TAP grants, with the second highest, $14 million to $16 million in TAP grants. According to my analysis, only one school (CUNY Manhattan College) had more than $30 million in TAP grants in 3 consecutive year (2015 to 2017 Academic Year) and this is because they had the highest number of students with the highest Full Time Students enrollments receiving those awards (over 11000 students with over 8000 Full Time enrollments getting the TAP awards)

DOUBLE VARIABLE DISTRIBUTION PLOTS

Line Plot

We see from the line plot here that more TAP dollars were awarded to the schools in 2015 than any other year.

Also, CUNY Community Colleges received more TAP awards in dollars than the CUNY Senior Colleges even though there are more senior colleges from community colleges.

Bar Chart

We see on this bar chart that many of the CUNY Community Colleges had more students receiving TAP grants than students in the Senior Colleges, with CUNY Manhattan College leading the charge. This helps to explain the above line plot where community colleges received more TAP dollars than the senior colleges.

DUMMY VARIABLES

I created dummy variables by converting the 2 qualitative columns into quantitative variables. The columns that I converted to dummy variables are TAP Sector Group (which determined whether the CUNY college is a community college or a senior college) and also the TAP Level of Study(which determines whether the the program of study is a 2 year program(Associate Degree), 4 year program(Bachelor's Degree) or a 5 year program (Combined Bachelor's/Masters Degree).

DEPENDENT VARIABLE & INDEPENDENT VARIABLES

To successfully apply any machine learning model to the data and make predictions, it's important that the dataset is separated into independent variables and dependent variables. Dependent variable is the variable or column we are trying to predict and independent variables are the variables/parameters/columns in the dataset that will be used to predict the independent variable.

In this dataset, I am trying to predict the TAP Recipient Dollars which is the amount of New York State Tuition Assistance Program (TAP) grants CUNY Colleges were awarded so our dependent variable is TAP Recipient Dollars. This is what we will be using all the different machine learning models to predict to determine which one of them makes the best predictions.

FIRST DATA

This first data has all the TAP Recipient Headcount (Number of recipients as measured by students receiving at least one term award during the academic year.), TAP Recipient FTEs (Number of recipients as measured by academic year Full-Time Equivalents), TAP Sector Group (which determined whether the CUNY college is a community college or a senior college), TAP Level of Study (which determines whether the the program of study is a 2 year program(Associate Degree), 4 year program(Bachelor's Degree) or a 5 year program (Combined Bachelor's/Masters Degree) as the independent variables that we will use to make the predictions.

Then I split the data in 80% training data and 20% testing data with the train_test_split from the scikit learn module.

SECOND DATA

The second data has only TAP Sector Group and TAP Level of Study and as the independent variables that we will use to make the predictions.

We dropped off the TAP Recipient Headcount and TAP Recipient FTEs parameters.

Then I split the data in 80% training data and 20% testing data with the train_test_split from the scikit learn module.

SELECTED MACHINE LEARNING ALGORITHMS FOR MAKING PREDICTIONS

1. LINEAR REGRESSION MODEL

For both the 1st data and the 2nd data, I applied the linear regression model on training data and used it to make predictions on the test data. After which I found the mean squared error on both data. The mean squared error function from the scikit learn module is used to measure the amount of error on a prediction so you want to minimize your mean squared error because the lesser the mean squared error on a prediction is the better and vice versa.

After checking the mean squared error on both data, I observed that the first data had the least mean squared error of 998108675805.5972 and the second data had a mean squared error of 158290233158039.97 and what this means is that the independent variables (TAP Recipient Headcount and TAP Recipient FTEs) that we dropped from the first data has a greater impact on our dataset and when used with this model will give us the best predictions for our dataset.

2. DECISION TREE MODEL

After applying the Decision Tree model and checking for the mean squared error on both data, I observed that the least mean squared error of the first data was 1442856254433.26 with a maximum depth of 5 and the second data with a higher mean squared error of 24757505151600.73 with a maximum depth of 5.

The graph below shows that the first data has the least mean squared error when this model is applied to it and what this means is that the independent variables (TAP Recipient Headcount and TAP Recipient FTEs) that we dropped from the first data has a greater impact on our dataset and when used with this model will give us the best predictions for our dataset.

3. K-NEIGHBORS MODEL

After applying the K-Neighbors model and checking for the mean squared error on both data, I observed that the first data had the least mean squared error of 1345078792233.6118 with k as 3 and the second data with a higher mean squared error of 23850344688776.293 with k as 6.

BEST REGRESSION MODEL FOR OUR DATA

Looking at all the models used, we can conclude that the linear regression model is the best model to use to make our predictions because it has the least mean squared error when used to make predictions compared to the other models applied.

CLASSIFICATION TASK

A classification task is a task where we try to predict a qualitative or categorical data for example trying to predict whether a person was diagnosed with an illness or not, whether a person is pregnant or not, and so on.

In our case, we are going to predict whether a school is a CUNY Community College or a CUNY Senior College by using a Logistic Regression Model and fit it to the TAP Recipient Headcount, TAP Recipient FTEs and TAP Recipient Dollars as the independent variables.

After applying the K-Logistic Regression Model and evaluating the model, we achieve an 83% accuracy of the model in determining whether a school is a Senior College or a Community College.

In our dataset, 0 indicates that a school is Senior College and 1 indicates that the school is a Community College.

This graph indicates CUNY Community Colleges more TAP Recipient Headciunt more than the CUNY Senior Colleges. What this means is that more students from the Community Colleges received TAP grants from 2010 to 2019 more than the Senior Colleges.

Page updated

Google Sites

Report abuse