Data Wrangling Exercises with R

Classification using K-Nearest-Neighbor (KNN), Support Vector Machine (SVM), and K-Means Methods

In this project, I examined several different datasets (German credit card, Iris flowers, etc.) and explored the use of three data science methods to classify or cluster the data points into groups based on their features.

Here is the full code for these exercises.

Trying to classify the credit card customers into high and low default risk groups, I first used K-Nearest-Neighborhood (KNN) and Cross-Validation methods due to the limited sample size for the analysis. Then, I also used a Support Vector Machine (SVM) method to classify the data and identified the optimal C parameter based on the classification accuracy levels.

In addition, for the popular iris data set, after initial exploratory analysis and correlational analysis, an unsupervised K-means algorithm was tested for clustering the various iris types into distinct groups.

A defined function that would run the KNN model and loop through 10 different folds for Cross-Validation, for any given K value.

A graph showing the accuracy levels of SVM models with different C parameter values.

Exploring the relationship between any two variables out of the four, visualized with mini scatter plots.

Correlation Matrix for the four variables.

Identifying the optimal K values for the K-means algorithm, using the within-cluster sum of squares.

Final classification results.

CUSUM and Exponential Smoothing with Time Series Data

A quick analysis trying to figure out when is the end of summer for Atlanta, and whether the City has become hotter over the past several decades? The Cumulative Sum Control Chart (CUSUM) method was used in detecting any signals of changing seasons or trend fluctuations.

Full code for this analysis is here.

The daily-high-temperature from July to October in 1996.

Identified end of summer dates for years since 1996 to 2015, using the CUSUM method.

Exponential Smoothing

I also did some additional analysis using the Exponential Smoothing (EM) method for the same temperature data. Code for this analysis is here.

Decomposition of the time series data gave a straight look into the several forces behind the time series data.

Several different Exponential Smoothing models, from simple to more complex ones, using the temperature data in years since 1996 to 2015.

Smoothed line graph using Exponential Smoothing against the actual data points.

Interested to learn more? Feel free to connect me at wenhaowu92@gmail.com!