Ashley Schuliger 26 January ,2021
While working on the Flowability project, we decided to try using LDA/QDA models for prediction. As we continued down this path, we found that our data set had a high dimensionality with a low sample size, which was not a great fit for these models. Such data sets tend to create overfit models. This means that these models are extremely accurate when predicting any data point in the initial data set, but it will struggle to predict any new data point.
Luckily for us, we found that regularizing the data set before fitting a model would reduce this “curse of dimensionality”. Some regularization techniques that could be used included running PCA and then LDA, and RDA (regularized discriminant analysis). For this blog, we will discuss how to go about performing RDA on the “Classifying Wine Varieties” dataset on Kaggle using the Scikit learn library. We chose to use this dataset since it has a high dimensionality and low sample size.
At the start, we want to import a few important libraries, including Pandas, Numpy, MatPlotLib, and Scikit learn. Most of these libraries make it easier for us to manipulate and analyze the data.
Once we have our imports, we need to read the data file into a pandas DataFrame. This is a data type in the pandas library that allows us to easier work with the data. After reading the file, we split up the data into x and y variables as shown below.
Once we have our data, we can run RDA on the dataset using the Scikit learn library. This library contains a LinearDiscriminantAnalysis class that we will use down below. It allows users to apply regularization to the data with a parameter called “Shrinkage”. This parameter is an integer between 0 and 1 that can be set manually or automatically using “auto” as the parameter. In our case, we use “auto” to allow the algorithm to determine the best shrinkage for us. For the sake of simplicity, we use the Eigen solver since it allows for shrinkage. Once the model is created, we can fit the model to our data and transform the data accordingly.
Once we have our fit model, we can visualize the results to determine if our data has been separated appropriately. We do this by creating a scatterplot of the transformed data points, rather than the original data. These data points are then labeled with a color corresponding to the class they are labeled as.
Below is the visualized result of our RDA model. As you can see, our model easily separated our classes and allowed us to create boundaries to classify our data.
We hope that you learned a bit about RDA and how it can be used to classify datasets with high dimensionality.