1) Analysis of E-commence company data using linear Regression:
In this project it was assumed that we got some contract work with an Ecommerce company based in New York City that sells clothing online but they also have in-store style and clothing advice sessions. Customers come in to the store, have sessions/meetings with a personal stylist, then they can go home and order either on a mobile app or website for the clothes they want. The company(fictitious) was trying to decide whether to focus their efforts on their mobile app experience or their website. This project's main goal was to help them figure it out!
The Ecommerce Customers.csv file was used as the data obtained from the company. It had Customer info, such as Email, Address, and their color Avatar. Then it also has numerical value columns:
Avg. Session Length: Average session of in-store style advice sessions.
Time on App: Average time spent on App in minutes
Time on Website: Average time spent on Website in minutes
Length of Membership: How many years the customer has been a member.
The first step of this project was Exploratory Data Analysis (EDA). We then split our data. 70% of data was in the training set and 30% was in the test set. Next, a linear regression model was created using the training data set. The model was used to predict the yearly amount spent by the company. The model was also evaluated based on the true labeled data and predicted data having Mean Average Error: 7.22814865343
Mean Square Error: 79.813051651
Root mean Square Error: 8.93381506698
Then the model coefficients were checked to find some insights on the original task whether the company should focus on mobile application or website which mostly depended on other factors in the company.
2) Determination of clicking of ads using Logistic Regression technique:
In this project we worked with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We then split our data. 70% of data was in the training set and 30% was in the test set. We created a model using the training data set that predicted whether or not users would click on an ad based off the features of that user.
This data set contained the following features:
'Daily Time Spent on Site': consumer time on-site in minutes
'Age': customer age in years
'Area Income': Avg. Income of geographical area of consumer
'Daily Internet Usage': Avg. minutes a day consumer is on the internet
'Ad Topic Line': Headline of the advertisement
'City': City of consumer
'Male': Whether or not consumer was male
'Country': Country of consumer
'Timestamp': Time at which consumer clicked on Ad or closed window
'Clicked on Ad': 0 or 1 indicated clicking on Ad
For this project at first we did some exploratory data analysis to explore and visualize the data features. Then the logistic regression model was created and
evaluated based on the precision and recall. We also generated the classification report based on the true labels and predicted labels.
3) K Nearest Neighbors Project:
In this project we explored an unknown dataset where the fields names were hidden. So we standardize the data using the StandardScaler class from sci-kit learn. We fit the data using settings with true mean and standard deviation to do proper data-scaling. We then split our data. 70% of data was in training set and 30% was in the test set. We applied the K nearest neighbours algorithm to better find the k nearest neighbors or clusters in the dataset using the elbow-graphical method. We also evaluated our model both before and after finding an optimal k value. We also generated the classification report based on the true labels and predicted labels.
4) Random Forest Project:
For this project we explored publicly available data from LendingClub.com. Lending Club connects people who need money (borrowers) with people who have money (investors). Hopefully, as investors we would want to invest in people who showed a profile of having a high probability of paying us back. We tried to create a model that would help predict this. We used lending data from 2007-2010 and tried to classify and predict whether or not the borrower paid back their loan in full.
Here are what the columns represent:
credit.policy: 1 if the customer meets the credit underwriting criteria of LendingClub.com, and 0 otherwise.
purpose: The purpose of the loan (takes values "credit_card", "debt_consolidation", "educational", "major_purchase", "small_business", and "all_other").
int.rate: The interest rate of the loan, as a proportion (a rate of 11% would be stored as 0.11). Borrowers judged by LendingClub.com to be more risky are assigned higher interest rates.
installment: The monthly installments owed by the borrower if the loan is funded.
log.annual.inc: The natural log of the self-reported annual income of the borrower.
dti: The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico: The FICO credit score of the borrower.
days.with.cr.line: The number of days the borrower has had a credit line.
revol.bal: The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol.util: The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq.last.6mths: The borrower's number of inquiries by creditors in the last 6 months.
delinq.2yrs: The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub.rec: The borrower's number of derogatory public records (bankruptcy filings, tax liens, or judgments).
We first did some exploratory data analysis and then we transformed 1 categorical feature (purpose) into a dummy variable. We then split our data. 70% of data was in the training set and 30% was in the test set. After that we built a decision tree classifier model using the training data set and evaluated the model based on the true label and predicted label by generating the classification report containing precision and recall of the model. Similarly, we utilized a random forest classifier model to predict the true label and evaluated our model based on classification report and confusion matrix.
5) Analysis of Iris dataset using Support Vector Machine (SVM) algorithm:
For this project we analyzed the famous iris data set and used the famous Iris flower data set. The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by Sir Ronald Fisher in the 1936 as an example of discriminant analysis. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor), so 150 total samples. Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Python data-visualization library: Seaborn was used to analyze the iris data set for this project.
In the first phase of the project we did exploratory data analysis and then we then split our data. 70% of data was in training set and 30% was in the test set. Next, we developed a support vector machine classifier model using the training data set with default parameters of the SVC class in sklearn module of Python to distinguish the different iris flowers' species. We then evaluated our model by generating the classification report based on true labels and predicted labels. We then adopted the Grid Search method to tune in the parameters of the SVC class to improve the performance of our model and generated the classification report and confusion matrix to evaluate our tuned model.
6) Clustering University-information dataset using K Means algorithm:
For this project we attempted to use k-means Clustering to cluster Universities into to two groups, Private and Public. We actually had the labels for this data set, but we did not use them for the k-means clustering algorithm, since this is unsupervised learning algorithm. Under normal circumstances while using the k-means algorithm, we don't have labels. In this case we used the labels to try to get an idea of how well the algorithm performed and we just showed the classification report and confusion matrix at the end of this project.
The following data frame with 777 observations on the following 18 variables were used for this project.
Private A factor with levels No and Yes indicating private or public university
Apps Number of applications received
Accept Number of applications accepted
Enroll Number of new students enrolled
Top10perc Pct. new students from top 10% of H.S. class
Top25perc Pct. new students from top 25% of H.S. class
F.Undergrad Number of fulltime undergraduates
P.Undergrad Number of parttime undergraduates
Outstate Out-of-state tuition
Room.Board Room and board costs
Books Estimated book costs
Personal Estimated personal spending
PhD Pct. of faculty with Ph.D.’s
Terminal Pct. of faculty with terminal degree
S.F.Ratio Student/faculty ratio
perc.alumni Pct. alumni who donate
Expend Instructional expenditure per student
Grad.Rate Graduation rate
In the first phase of this project, we did some exploratory data analysis and visualization of the data frame using pandas, Numpy and seaborn. Then we clustered the data into two clusters and predicted the labels of those. In this exercise project, we had the original labels so we took advantage of the labels and evaluated our k-means classifier model by generating the classification report and confusion matrix to see how well the k-means clustering worked without being given any labels. The takeaway of this project was to see how k-means can be used to cluster
un-labeled data.
7) Vehicle license plate detection using advanced image processing techniques and machine learning
8) Facial expression recognition using Eigen sub-face method and principal component analysis
9) Detection of no. of faces in an image using normalized pixel difference method and deep quadratic tree
10) Natural Language Processing Project:
In this NLP project we attempted to classify Yelp Reviews of the Yelp Kaggle's dataset into 1 star or 5 star categories based off the text content in the reviews. We used the Yelp Review Data Set from Kaggle. Each observation in this dataset was a review of a particular business by a particular user. The "stars" column was the number of stars (1 through 5) assigned by the reviewer to the business. (Higher stars is better.) In other words, it was the rating of the business by the person who wrote the review. The "cool" column was the number of "cool" votes this review received from other Yelp users. All reviews starting with 0 "cool" votes, and there was no limit to how many "cool" votes a review could receive. In other words, it was a rating of the review itself, not a rating of the business. The "useful" and "funny" columns were similar to the "cool" column.
We first explored the data using pandas, numpy and seaborn and did some analysis of the dataset. Then for the NLP classification task we separated the reviews having 1 star or 5 stars. We used the countvectorizer class to convert a reviews in text to a matrix of token counts and generated the Naïve Bayes classifier models to fit the data (matrix of tokens). We generated the classification report and confusion matrix based on the true data-labels and predicted labels to evaluated the performance of our model. Next we used the technique of text processing and pipeline method to train and design our model. In the pipeline method we combined the following classes of sci-kit learn: CountVectorizer(), TfidfTransformer(), MultinomialNB()to generate a pipeline model. We then fit the model with training data and predicted the test data. Finally we used classification report and confusion matrix to check the performance of our pipeline model.