## IntroductionIn R, we often use multiple packages for doing various machine learning tasks. For example: we impute missing value using one package, then build a model with another and finally evaluate their performance using a third package. The problem is, every package has a set of specific parameters. While working with many packages, we end up spending a lot of time to figure out which parameters are important. Don’t you think? To solve this problem, I researched and came across a R package named In this tutorial, I’ve taken up a classification problem and tried improving its accuracy using machine learning. I haven’t explained the ML algorithms (theoretically) but focus is kept on their implementation. By the end of this article, you are expected to become proficient at implementing several ML algorithms in R. But, only if you practice alongside.
## Table of Content- Getting Data
- Exploring Data
- Missing Value Imputation
- Feature Engineering
- Outlier Removal by Capping
- New Features
- Machine Learning
- Feature Importance
- QDA
- Logistic Regression
- Cross Validation
- Decision Tree
- Cross Validation
- Parameter Tuning using Grid Search
- Random Forest
- SVM
- GBM (Gradient Boosting)
- Cross Validation
- Parameter Tuning using Random Search (Faster)
- XGBoost (Extreme Gradient Boosting)
- Feature Selection
## Machine Learning with MLR PackageUntil now, R didn’t have any package / library similar to Scikit-Learn from Python, wherein you could get all the functions required to do machine learning. But, since February 2016, R users have got mlr package using which they can perform most of their ML tasks. Let’s now understand the basic concept of how this package works. If you get it right here, understanding the whole package would be a mere cakewalk. The entire structure of this package relies on this premise: Create a Task. Make a Learner. Train Them. Creating a task means loading data in the package. Making a learner means choosing an algorithm ( learner) which learns from task (or data). Finally, train them. MLR package has several algorithms in its bouquet. These algorithms have been categorized into regression, classification, clustering, survival, multiclassification and cost sensitive classification. Let’s look at some of the available algorithms for classification problems:
And, there are many more. Let’s start working now!
## 1. Getting DataFor this tutorial, I’ve taken up one of the popular ML problem from DataHack (one time login will be required to get data): Download Data. After you’ve downloaded the data, let’s quickly get done with initial commands such as setting the working directory and loading data.
## 2. Exploring DataOnce the data is loaded, you can access it using:
This functions gives a much comprehensive view of the data set as compared to base
From these outputs, we can make the following inferences: - In the data, we have 12 variables, out of which
`Loan_Status` is the dependent variable and rest are independent variables. - Train data has 614 observations. Test data has 367 observations.
- In train and test data, 6 variables have missing values (can be seen in na column).
`ApplicantIncome` and`Coapplicant Income` are highly skewed variables. How do we know that ? Look at their min, max and median value. We’ll have to normalize these variables.`LoanAmount` ,`ApplicantIncome` and`CoapplicantIncome` has outlier values, which should be treated.`Credit_History` is an integer type variable. But, being binary in nature, we should convert it to factor.
Also, you can check the presence of skewness in variables mentioned above using a simple histogram.
As you can see in charts above, skewness is nothing but concentration of majority of data on one side of the chart. What we see is a right skewed graph. To visualize outliers, we can use a boxplot:
Similarly, you can create a boxplot for Let’s change the class of
To check the changes, you can do:
You can further scrutinize the data using:
We find that the variable
## 3. Missing Value ImputationNot just beginners, even good R analyst struggle with missing value imputation. MLR package offers a nice and convenient way to impute missing value using multiple methods. After we are done with much needed modifications in data, let’s impute missing values. In our case, we’ll use basic mean and mode imputation to impute data. You can also use any ML algorithm to impute these values, but that comes at the cost of computation.
This function is convenient because you don’t have to specify each variable name to impute. It selects variables on the basis of their classes. It also creates new dummy variables for missing values. Sometimes, these (dummy) features contain a trend which can be captured using this function.
Now, we have the complete data. You can check the new variables using:
Did you notice a disparity among both data sets? No ? See again. The answer is
However, it is always advisable to treat missing values separately. Let’s see how can you treat missing value using rpart:
## 4. Feature EngineeringFeature Engineering is the most interesting part of predictive modeling. So, feature engineering has two aspects: Feature Transformation and Feature Creation. We’ll try to work on both the aspects here. At first, let’s remove outliers from variables like
I’ve chosen the threshold value with my discretion, after analyzing the variable distribution. To check the effects, you can do In both data sets, we see that all dummy variables are numeric in nature. Being binary in form, they should be categorical. Let’s convert their classes to factor. This time, we’ll use simple
These loops say – ‘for every column name which falls column number 14 to 20 of cd_train / cd_test data frame, if the class of those variables in numeric, take out the unique value from those columns as levels and convert them into a factor (categorical) variables. Let’s create some new features now.
While creating new features(if they are numeric), we must check their correlation with existing variables as there are high chances often. Let’s see if our new variables too happens to be correlated:
As we see, there exists a very high correlation of Now we can remove the variable.
There is still enough potential left to create new variables. Before proceeding, I want you to think deeper on this problem and try creating newer variables. After doing so much modifications in data, let’s check the data again:
## 5. Machine LearningUntil here, we’ve performed all the important transformation steps except normalizing the skewed variables. That will be done after we create the task. As explained in the beginning, for mlr, a task is nothing but the data set on which a learner learns. Since, it’s a classification problem, we’ll create a classification task. So, the task type solely depends on type of problem at hand.
Let’s check trainTask
As you can see, it provides a description of
For a deeper view, you can check your task data using Now, we will normalize the data. For this step, we’ll use
Before we start applying algorithms, we should remove the variables which are not required.
MLR package has an in built function which returns the important variables from data. Let’s see which variables are important. Later, we can use this knowledge to subset out input predictors for model improvement. While running this code, R might prompt you to install ‘FSelector’ package, which you should do.
If you are still wondering about Let’s start modeling now. I won’t explain these algorithms in detail but I’ve provided links to helpful resources. We’ll take up simpler algorithms at first and end this tutorial with the complexed ones. With MLR, we can choose & set algorithms using
In general, qda is a parametric algorithm. Parametric means that it makes certain assumptions about data. If the data is actually found to follow the assumptions, such algorithms sometime outperform several non-parametric algorithms. Read More.
Upload this submission file and check your leaderboard rank (wouldn’t be good). Our accuracy is ~ 71.5%. I understand, this submission might not put you among the top on leaderboard, but there’s along way to go. So, let’s proceed.
This time, let’s also check cross validation accuracy. Higher CV accuracy determines that our model does not suffer from high variance and generalizes well on unseen data.
Similarly, you can perform CV for any learner. Isn’t it incredibly easy? So, I’ve used stratified sampling with 3 fold CV. I’d always recommend you to use stratified sampling in classification problems since it maintains the proportion of target class in n folds. We can check CV accuracy by:
This is the average accuracy calculated on 5 folds. To see, respective accuracy each fold, we can do this:
Now, we’ll train the model and check the prediction accuracy on test data.
Woah! This algorithm gave us a significant boost in accuracy. Moreover, this is a stable model since our CV score and leaderboard score matches closely. This submission returns accuracy of 79.16%. Good, we are improving now. Let’s get ahead to the next algorithm.
## 3. Decision TreeA decision tree is said to capture non-linear relations better than a logistic regression model. Let’s see if we can improve our model further. This time we’ll hyper tune the tree parameters to achieve optimal results. To get the list of parameters for any algorithm, simply write (in this case rpart):
This will return a long list of tunable and non-tunable parameters. Let’s build a decision tree now. Make sure you have installed the
I’m doing a 3 fold CV because we have less data. Now, let’s set tunable parameters:
As you can see, I’ve set 3 parameters.
You may go and take a walk until the parameter tuning completes. May be, go catch some pokemons! It took 15 minutes to run at my machine. I’ve 8GB intel i5 processor windows machine.
It returns a list of best parameters. You can check the CV accuracy with:
Using
Decision Tree is doing no better than logistic regression. This algorithm has returned the same accuracy of 79.14% as of logistic regression. So, one tree isn’t enough. Let’s build a forest now.
## 4. Random ForestRandom Forest is a powerful algorithm known to produce astonishing results. Actually, it’s prediction derive from an ensemble of trees. It averages the prediction given by each tree and produces a generalized result. From here, most of the steps would be similar to followed above, but this time I’ve done
Though, random search is faster than grid search, but sometimes it turns out to be less efficient. In grid search, the algorithm tunes over every possible combination of parameters provided. In a random search, we specify the number of iterations and it randomly passes over the parameter combinations. In this process, it might miss out some important combination of parameters which could have returned maximum accuracy, who knows.
Now, we have the final parameters. Let’s check the list of parameters and CV accuracy.
Let’s build the random forest model now and check its accuracy.
No new story to cheer about. This model too returned an accuracy of 79.14%. So, try using grid search instead of random search, and tell me in comments if your model improved.
## 5. SVMSupport Vector Machines (SVM) is also a supervised learning algorithm used for regression and classification problems. In general, it creates a hyperplane in n dimensional space to classify the data based on target class. Let’s step away from tree algorithms for a while and see if this algorithm can bring us some improvement. Since, most of the steps would be similar as performed above, I don’t think understanding these codes for you would be a challenge anymore.
This model returns an accuracy of 77.08%. Not bad, but lesser than our highest score. Don’t feel hopeless here. This is core machine learning. ML doesn’t work unless it gets some good variables. May be, you should think longer on feature engineering aspect, and create more useful variables. Let’s do boosting now.
Now you are entering the territory of boosting algorithms. GBM performs sequential modeling i.e after one round of prediction, it checks for incorrect predictions, assigns them relatively more weight and predict them again until they are predicted correctly.
The accuracy of this model is 78.47%. GBM performed better than SVM, but couldn’t exceed random forest’s accuracy. Finally, let’s test XGboost also.
## 7. XgboostXgboost is considered to be better than GBM because of its inbuilt properties including first and second order gradient, parallel processing and ability to prune trees. General implementation of xgboost requires you to convert the data into a matrix. With mlr, that is not required. As I said in the beginning, a benefit of using this (MLR) package is that you can follow same set of commands for implementing different algorithms.
Terrible XGBoost. This model returns an accuracy of 68.5%, even lower than qda. What could happen ? Overfitting. So, this model returned CV accuracy of ~ 80% but leaderboard score declined drastically, because the model couldn’t predict correctly on unseen data.
## What can you do next? Feature Selection ?For improvement, let’s do this. Until here, we’ve used trainTask for model building. Let’s use the knowledge of important variables. Take first 6 important variables and train the models on them. You can expect some improvement. To create a task selecting important variables, do this:
So, I’ve asked this function to get me top 6 important features using the random forest importance feature. Now, replace Also, try to create more features. The current leaderboard winner is at ~81% accuracy. If you have followed me till here, don’t give up now.
## End NotesThe motive of this article was to get you started with machine learning techniques. These techniques are commonly used in industry today. Hence, make sure you understand them well. Don’t use these algorithms as black box approaches, understand them well. I’ve provided link to resources. What happened above, happens a lot in real life. You’d try many algorithms but wouldn’t get improvement in accuracy. But, you shouldn’t give up. Being a beginner, you should try exploring other ways to achieve accuracy. Remember, no matter how many wrong attempts you make, you just have to be right once. You might have to install packages while loading these models, but that’s one time only. If you followed this article completely, you are ready to build models. All you have to do is, learn the theory behind them. Did you find this article helpful ? Did you try the improvement methods I listed above ? Which algorithm gave you the max. accuracy? Share your observations / experience in the comments below. |