Below are some notes about courses and studies on machine learning, a discipline in the universe of artificial intelligence that, through algorithms, gives machines the ability to identify and create patterns on massive data and perform predictive analysis, or rather, make predictions!
We have seen before the categories of supervised, unsupervised, and reinforcement learning, the main machine learning algorithms, and the process of building predictive models. Now we’ll see all of this in practice in the Python language through its main packages and build some machine learning models.
Study Notebook
Visit the Jupyter Notebook to see the concepts that will cover here about all preprocessing, machine learning algorithm training, model evaluation, and new data prediction during the Machine Learning template composition. The goal here is to address an introduction to the topic.
This notebook contains a template of the code needed to create the main Machine Learning algorithms. Within scikit-learn, we have several algorithms ready. Just adjust the parameters, feed them with the data, do the training, produce the model and finally make predictions.
Summarizing: For linear regression, we need to import the linear_model from the sklearn library. We prepare the training and test data. Then we create the predictive model by instantiating it into an object called linear the LinearRegression algorithm that is a class of the linear_model package. Later we trained the algorithm with the fit function and then used the score to evaluate the model. Finally, we print the coefficients and make new predictions with the model.
# Import modules
from sklearn import linear_model
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted_variable
x_test = test_dataset_precictor_variables
# Create linear regression object
linear = linear_model.LinearRegression()
# Train the model with training data and check the score
linear.fit(x_train, y_train)
linear.score(x_train, y_train)
# Collect coefficients
print('Coefficient: \n', linear.coef_)
print('Intercept: \n', linear.intercept_)
# Make predictions
predicted_values = linear.predict(x_test)
In this case, the only thing that changes from linear regression to logistic regression is the algorithm we’re going to use. We changed LinearRegression to LogisticRegression.
# Import modules
from sklearn.linear_model import LogisticRegression
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted_variable
x_test = test_dataset_precictor_variables
# Create logistic regression object
model = LogisticRegression()
# Train the model with training data and checking the score
model.fit(x_train, y_train)
model.score(x_train, y_train)
# Collect coefficients
print('Coefficient: \n', model.coef_)
print('Intercept: \n', model.intercept_)
# Make predictions
predicted_vaues = model.predict(x_test)
Once again, we changed the algorithm to DecisionTreeRegressor:
# Import modules
from sklearn import tree
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted_variable
x_test = test_dataset_precictor_variables
# Create Decision Tree Regressor Object
model = tree.DecisionTreeRegressor()
# Create Decision Tree Classifier Object
model = tree.DecisionTreeClassifier()
# Train the model with training data and checking the score
model.fit(x_train, y_train)
model.score(x_train, y_train)
# Make predictions
predicted_values = model.predict(x_test)
This time we use the GaussianNB algorithm:
# Import modules
from sklearn.naive_bayes import GaussianNB
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Create GaussianNB object
model = GaussianNB()
# Train the model with training data
model.fit(x_train, y_train)
# Make predictions
predicted_values = model.predict(x_test)
In this case, we use the SVC class of the SVM library. If it were SVR, it would be a regressor:
# Import modules
from sklearn import svm
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Create SVM Classifier object
model = svm.svc()
# Train the model with training data and checking the score
model.fit(x_train, y_train)
model.score(x_train, y_train)
# Make predictions
predicted_values = model.predict(x_test)
In the KneighborsClassifier algorithm, we have a hyperparameter called n_neighbors, making adjustments to this algorithm.
# Import modules
from sklearn.neighbors import KNeighborsClassifier
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Create KNeighbors Classifier Objects
KNeighborsClassifier(n_neighbors = 6) # default value = 5
# Train the model with training data
model.fit(x_train, y_train)
# Make predictions
predicted_values = model.predict(x_test)
# Import modules
from sklearn.cluster import KMeans
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Create KMeans objects
k_means = KMeans(n_clusters = 3, random_state = 0)
# Train the model with training data
model.fit(x_train)
# Make predictions
predicted_values = model.predict(x_test)
# Import modules
from sklearn.ensemble import RandomForestClassifier
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Create Random Forest Classifier objects
model = RandomForestClassifier()
# Train the model with training data
model.fit(x_train, x_test)
# Make predictions
predicted_values = model.predict(x_test)
# Import modules
from sklearn import decomposition
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Creating PCA decomposition object
pca = decomposition.PCA(n_components = k)
# Creating Factor analysis decomposition object
fa = decomposition.FactorAnalysis()
# Reduce the size of the training set using PCA
reduced_train = pca.fit_transform(train)
# Reduce the size of the training set using PCA
reduced_test = pca.transform(test)
# Import modules
from sklearn.ensemble import GradientBoostingClassifier
# Create training and test subsets
x_train = train_dataset_predictor_variables
y_train = train_dataset_predicted variable
x_test = test_dataset_precictor_variables
# Creating Gradient Boosting Classifier object
model = GradientBoostingClassifier(n_estimators = 100,
learning_rate = 1.0, max_depth = 1, random_state = 0)
# Training the model with training data
model.fit(x_train, x_test)
# Make predictions
predicted_values = model.predict(x_test)
Therefore, our job will be to transform each block of these algorithms into a project. First, defining a business problem, pre-processing the data, training the algorithm, adjusting hyperparameters, verifiable results, and iterate in this process until we reach a satisfactory accuracy to make the desired predictions.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
In the last tutorial, we completed the Data Pre-Processing step. We saw pre-processing techniques applied in transformation and variable selection, dimensionality reduction, and sampling for machine learning throughout this previous tutorial.
Now we can move on to the next steps within the Data Science process. We’ll apply the rest of the model building process with various Regression Algorithms to understand and how to use machine learning with python language. In the next moment, we will discuss the Classification algorithms.
We will not go into detail about the algorithms. Instead, the purpose here will be to understand the detailed process of building the Machine Learning model, training models, model evaluation, and prediction.
See the Jupyter Notebook for the concepts we’ll cover on building machine learning models and my LinkedIn profile for other Data Science articles and tutorials.
We previously worked with classification algorithms, and now we will address the regression algorithms. Both classification and regression are subcategories of supervised learning. When we deliver input data and output data to the algorithm, we predict classes in the classification, and in the regression, we expect numerical values.
The process of constructing the model is the same regardless of the algorithm. What will change in essence is the algorithm used and the metric for evaluating the model, the rest of the process is standard. The techniques may be slightly different depending much more on the data set we are working with, but the process will be the same regardless of the algorithm.
Let’s create a predictive model that can predict the price of homes based on some variables (characteristics) on several homes in a Boston neighborhood. Then, based on a series of attributes, we will indicate a numeric value through regression.
We need metrics to evaluate the outcome of a regression model. Therefore, the choice of an algorithm will define which metric will be used to measure your performance. scikit-learn does not implement all performance metrics:
Mean Squared Error (MSE) — Average Square Error
Root Mean Squared Error (RMSE) — Square Root MSE
Mean Absolute Error (MAE) — Average Absolute Error
R Squared (R²) — Coefficient of Determination
Adjusted R Squared (R²) — R Adjusted
Mean Square Percentage Error (MSPE)
Mean Absolute Percentage Error (MAPE)
Root Mean Squared Logarithmic Error (RMSLE)
from sklearn.metrics import mean_squared_error
Maybe it’s the most accessible metric to understand. N is the number of observations in the dataset, the sum, Yi of the historical values that have already been collected, and y^ is the model’s prediction. We square it so we don’t have negative values.
The algorithm is fed X (input predictor variable) and Y (output target variable) during algorithm training. The algorithm learns mathematical relationships and makes a prediction defined as y^.
After the prediction, we calculate the difference between the model forecast and the historical value of the target variable. This calculation will return an error rate — Average quadratic error. Depending on the value of the MSE, we were able to verify whether or not the model performs well through the error rate; that is, the smaller — the better the model.
It is perhaps the simplest and most common metric for regression evaluation and probably the least useful. The MSE misunderstands the average square error of our predictions. For each point, it calculates the square difference between the predictions and the actual value of the target variable and then calculates the average of those values.
The higher this value, the worse the model is. Of course, this value will never be negative since individual prediction errors square us, but it would be zero for a perfect model.
# MSE - Mean Squared Error
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = LinearRegression()
# Traning model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is::", mse)
Once we have the predictions, we Y_pred call the mean_squared_error that will apply the MSE. Finally, the function receives the parameter Y_test, the actual value of Y and Y_pred we calculated in the forecast set.
We have the MSE for this model of 28.53. The ideal is that we do a work of pre-processing, the transformation of variables, standardization in this data set to reduce the value of the MSE. The smaller the MSE, the better our model.
We use the MSE or RMSE depending on the type of interpretation we want for the final result. We should compare two models using the same metric.
To get to the RMSE result, calculate the square root of the MSE that we figured earlier with the mean_squared_error.
from sklearn.metrics import mean_absolute_error
In some situations, we will use absolute values, usually when we have outliers in the dataset. For this, we use the mean fundamental error, that is, the sum of the absolute difference between forecasts and actual values. Thus, we use absolute values instead of the squared error of the MSE to calculate.
The value of 0 indicates no error — the perfect prediction is very rare to happen. Our job as Data scientists is to reduce this rate as much as possible.
# MAE - Mean Absolute Error
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = LinearRegression()
# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# Metric Result
mae = mean_absolute_error(Y_test, Y_pred)
print("The MAE of the model is", mae)
With the Linear Regression, we have the MAE of 3.45, but be careful — we can’t compare the MAE to the MSE of 28.5! We should compare different models but with equal metrics.
from sklearn.metrics import r2_score
The advantage of R² is that it returns a coefficient that goes from 0 to 1. Thus, as higher, the better the model is, unlike the other metrics where we interpret an error rate below.
This metric reflects the level of accuracy of the predictions relative to the values observed through the r2_score.
# R²
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = LinearRegression()
# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# MetricResult
r2 = r2_score(Y_test, Y_pred)
print("The R2 of the model is:", r2)
from sklearn.linear_model import
Linear Regression is the most straightforward algorithm of all, where we have two main variants of the regression:
Simple Linear Regression: an input variable
Multiple Linear Regression: Many Input Variables
Regression assumes that the data are in Normal Distribution. The variables are relevant for the construction of the model. They are not collinear, that is, variables with high correlation — it is up to the Food Scientist the algorithm with really relevant variables.
# Linear Regression
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import LinearRegression
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = LinearRegression()
# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
Here we deal with multiple linear regression, and we are dealing with more than one predictor variable. Therefore, we use the LinearRegression algorithm of the linear_model module, load the data, place the predictor variables in X, the target variables in Y, divide the random form into training and test data, and then create the Linear Regression model and train the relationships of the training data.
Finally, we make the predictions. Once the predictions are made, we put them into the MSE metric to calculate the error rate of the forecasts.
With this Linear Regression Algorithm, we have an MSE error rate of 28.53% without pre-processing the data. Can we improve this error rate just by changing the algorithm? We could also apply normalization, standardization, variable transformation, variable selection, cross-validation — to focus on the process. We will change only the algorithm.
from sklearn.linear_model Ridge import
It is a Linear Regression algorithm where the Loss Function is modified to minimize the complexity of the model. A Loss Function is the Cost function or Error function.
When we build the model, we need to automate the process. To automate the process, we need to use the Gradient Descent algorithm. To use The Gradient Descent, we have to use a Cost Function to optimize and reduce the Cost Function, consequently reducing the model’s error rate.
Machine Learning is an optimization problem. We want to optimize the cost function, reduce the error rate of the model. Therefore, we have the Linear Regression algorithm, the Optimization algorithm, a Cost Function, which puts it all together and trains the model to find the best mathematical relationship between input and output data.
# Ridge Regression
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = Ridge()
# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
The MSE of the Ridge Regression model is 29.29%. That is, we can not improve the performance of the model. On the contrary, we just worsened by changing the algorithm.
from sklearn.linear_model import Lasso
Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a modification of linear regression, and like Ridge Regression, the Loss Function is modified to minimize model complexity.
The algorithm does change the penalty rate to have a more straightforward cost function to optimize.
# Lasso Regression
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Lasso
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing with the train_test_split()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = Lasso()
# Training model
model.fit(X_train, Y_train)
# Making predictiosn
Y_pred = model.predict(X_test)
# Metrics Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
The MSE with Lasso Regression result was 33.39. By changing the algorithm, we are increasing the error rate. We use the MSE to make this comparison.
from sklearn.linear_model import ElasticNet
ElasticNet is a form of regression regularization that combines the properties of ridge and lasso regression. The objective is to minimize the complexity of the model by penalizing the model using the sum of squares of the coefficients.
Therefore, all algorithms seen earlier are only variations — Linear Regression, a Ridge modification, and Lasso and now ElasticNet.
# ElasticNet Regression
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import ElasticNet
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
train_test_split()
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = ElasticNet()
# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# MetricResult
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
from sklearn.neighbors import KNeighborsRegressor
We can use KNN for both classification and regression; use the KNeighborsRegressor algorithm; it is the algorithm for regression.
# KNN
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = KNeighborsRegressor()
# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# Metric result = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
The MSE with KNN Regressor was 47.70. We made the cost function much worse. KNN is a much simpler algorithm, and perhaps for this dataset, it is not optimal.
from sklearn.tree import DecisionTreeRegressor
We can also use it in both classification and regression categories.
# Classification and Regression Trees
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = DecisionTreeRegressor()
# Training model
model.fit(X_train, Y_train)
# Making predictions Y_pred = model.predict(X_test)
# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
The MSE with CART was 30.03. Thus, we can already see an improvement in the lower cost function. Generally, decision trees perform excellently.
from sklearn.SVM import SVR
We use the SVC for classification and the SVR for regression. The rest is the same thing.
# Support Vector Machine
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.svm import SVR
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Divides the data into training and testingX_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5)
# Creating model
model = SVR()# Training model
model.fit(X_train, Y_train)
# Making predictions
Y_pred = model.predict(X_test)
# Metric result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
The MSE with SVR was 93.21! Worst performance ever. SVM is a much more complex algorithm, and because no processing has been done on the data, the algorithm rejects the data as-is.
We must do a little better treatment. A more complex algorithm that performs good results requires much more pre-processing.
Optimizing a regression model follows the same rules for classification, with no significant difference.
All machine learning algorithms are parameterized, which means you can adjust predictive model performance by tuning parameters.
The goal is to find the best combination of the parameters in each machine learning algorithm. This process is also called hyperparameter optimization. scikit-learn offers two methods for automatic parameter optimization
from sklearn.model_selection import GridSearchCV
This method methodically performs combinations between all algorithm parameters, creating a grid.
Let’s try this method using the Ridge Regression algorithm to see how we can optimize this algorithm in practice.
# Grid Search Parameter Tuning
# Import modules
from pandas import read_csv
import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Setting the values that will be tested
alpha_values = np.array([1,0.1,0.01,0.001,0.0001,0])
grid_values = dict(alpha = alpha_values)
# Creating model
model = Ridge()
# Creating grid
grid = GridSearchCV(estimator = model, param_grid = grid_values)
grid.fit(X, Y)
# Print the result of the best parameter for the algorithm
print("Best Model Parameters:", grid.best_estimator_)
We create a grid with the parameters we want to try and make a dictionary called grid_values. First, we started the Ridge model and then called GridSearchCV to test the combination of parameters.
The output will be the best parameters for the Ridge algorithm with GridSearchCV.
from sklearn.model_selection import RandomizedSearchCV
This method generates samples of algorithm parameters from a uniform random distribution for a fixed number of interactions.
A model is built and tested for each combination of parameters.
This example shows that the alpha parameter's value very close to 1 will present the best results.
Therefore, we compared the models according to the metrics and chose the one that has the best value.
# Random Search Parameter Tuning
# Import modules
from pandas import read_csv
import numpy as np
from scipy.stats import uniform
from sklearn.linear_model import Ridge
from sklearn.model_selection import RandomizedSearchCV
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Setting parameters
grid_values = {'alpha': uniform()}
seed = 7Creating model
model = Ridge()
iterations = 100
rsearch = RandomizedSearchCV(estimator = model,
param_distributions = grid_values,
n_iter = iterations,
random_state = seed)
rsearch.fit(X, Y)
# Result
print("Best Model Parameters:", rsearch.best_estimator_)
Therefore, we compared the models according to the metrics and chose the one that has the best value.
import pickle
# Saving result
# Import modules
from pandas import read_csv
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge
import pickle
# Loading data
file = 'http://lib.stat.cmu.edu/datasets/boston'
columns = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO','B', 'LSTAT', 'MEDV']
data = read_csv(file, delim_whitespace = True, names = columns)
array = data.values
# Separating the array into input and output components
X = array[:,0:13]
Y = array[:,13]
# Setting parameters
test_size = 0.35
seed = 7
# Divides the data into training and testing
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.33, random_state = 5) = seed)
# Creating model
model = Ridge()
# Training model
model.fit(X_train, Y_train)
# Saving model
file = 'final_regression_model.sav'
pickle.dump(model, open(file, 'wb'))
print("Model saved!")
# Loading model
final_regressor_model = pickle.load(open(file, 'rb'))
print("Model saved!")
# Making Predictions
Y_pred = final_regressor_model.predict(X_test)
# Metric Result
mse = mean_squared_error(Y_test, Y_pred)
print("The MSE of the model is:", mse)
Here we had an overview of the machine learning process, that is, building the models; the focus was not to detail how each algorithm works. Instead, our goal was to understand the process.
The data scientist’s job is to master as much as possible everything we’ve seen from pre-processing, model selection, performance metrics, model optimization, and forecasting.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
See the Jupyter Notebook for the concepts we’ll cover on building machine learning models and LinkedIn for other Data Science and Machine Learning tutorials. Ensemble, in general, means a group of things that are usually seen as a whole.
We have three main Ensemble categories:
Used for building multiple models (typically of the same type) from different subsets in the training dataset.
Used for constructing multiple models (typically of the same type), where each model learns to correct the errors generated by the previous model within the sequence of created models.
Used for building multiple models (typically of different types). Simple statistics (such as average) are used to combine predictions.
Ensemble methods are proven to be powerful methods to improve the accuracy and robustness of supervised, semi-supervised and unsupervised solutions.
Previously we saw a type of Ensemble method that is considered quite sophisticated, gradient boosting. The Gradient Boosting method unites the Boosting technique and gradient descent to predict the residuals of each of the base estimators. In other words, the algorithm creates a sequence of base estimators. Then, it indicates the residue for each of them so that the following estimator is more accurate and can reduce the residuals successively.
In the case of Gradient Boosting, even in classification models, the base estimators are regression trees. From now on, we will build a classifier with gradient boosting. After all, it is a potent model, and it contains some machine learning concepts. We will cover Gradient Boosting concepts of overfitting, regularization, tunning hyperparameters, and Stochastic Gradient Boosting.
First, let’s create a Gradient Boosting Classifier from a regressor base estimator.
We started by importing the make_hastler_10_2 function to create a mass of data to exemplify the algorithm, the train_test_split function of the sklearn model_selection module to divide the data into training and test sets, and finally, the GradientBoostingClassifier algorithm.
from sklearn.datasets import make_hastie_10_2
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
x, y = make_hastie_10_2(n_samples = 5000)
X_train, X_test, y_train, y_test = train_test_split(X, y)
We created the GradientBoostingClassifier classifier with 200 base estimators and 3 maximum depths of each tree that is the base estimator:
est = GradientBoostingClassifier(n_estimators = 200,max_depth = 3)
We trained the model by presenting the training data to the classifier:
est.fit(X_train, y_train)GradientBoostingClassifier(n_estimators=200)
pred = est.predict(X_test)
We scored the model with the accuracy metric for the classifier:
acc = est.score(X_test, y_test)
print('Accuracy: %.4f' % acc)
est.predict_proba(X_test)[0]
# array([0.82638636, 0.17361364])
When we print the probabilities with the predict_proba function, for the index record [0], our model predicted that the first class has an 82% chance of occurring, while the second class has the possibility of 17% occurrence.
In practice, our model believes that there is an 82% probability that index record 0 belongs to a particular class and 17% belongs to the other.
GradientBoostingClassifier(n_estimators=200)
Gradient Boosting Classifier default parameters
We call from the est object if we want to view each of the gradient boosting base estimators. We indicate the estimators_ attribute to look for the first estimator out of the 200 estimators we have for gradient boosting classifier:
est.estimator_[0, 0]
DecisionTreeRegressor(
criterion='friedman_mse', max_depth=3,random_state=RandomState(MT19937)
Although we have a classifier at the end, GradientBoosting was created internally in a series of Regression Trees as the base estimator. We’re learning from the errors of the previous models.
The most critical hyperparameters when working with GradientBoosting, regardless of whether you are a classifier or regressor, are:
We can adjust these parameters according to the need for the business problem or the dataset.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
This article will cover one of the most advanced algorithms and most widely used in analytical applications. This is an extensive subject, as we have several algorithms and various techniques for working with decision trees.
On the other hand, these algorithms are among the most powerful in Machine Learning and are easy to interpret. So, let’s start by defining what decision trees are and their representation through machine learning algorithms.
For decision tree learning models, we will study some algorithms with C4.5, C5.0, CART, and ID3. In addition, there are some specialized types of decision trees, and we will learn this following the chapter.
The main specialization of decision trees is RandomForest, which is nothing more than a collection of decision trees. We can use RandomForest for attribute selection, i.e., we can use decision trees for Machine Learning models themselves and apply feature selection techniques to prepare our dataset for other machine learning algorithms.
Finally, we will create models, make predictions, study the parameters and pre-processing details of decision trees, and interpret the results of predictive models.
When creating decision trees, we can have trees with lots of branches and leaves, and at some point, we will have to stop the construction of the tree or make adjustments to reduce the number of decision points in the predictive model.
This machine learning technique is easy to interpret; that is, we can quickly solve the result of a Decision Tree model, RandomForest, or even an Ensemble method, unlike other techniques such as Artificial Neural Networks or Deep Learning challenging to interpret the result.
Decision Trees are known as one of the most powerful and widely used machine learning modeling techniques. Decision Trees can naturally induce rules that can be used for data classification or to make predictions.
A decision tree is a decision support tool. Graphically presents the shape of an upside-down tree, where the root is at the top, and the leaves are at the bottom.
The concept behind the decision trees is straightforward. First, we define the ruleset, and for each rule, there is a decision that we must make.
We do this in our lives during the day, where the big challenge is making the computer understand all these rules and automatically decide one way or another.
Decision trees classify data instances representing a tree structure and starting from root to leaves. For the decision to occur, the flow starts from the source, which is the starting point. From there, there are conditions or condition checks that will determine the next step of the flow that are called nodes. Finally, the decision itself takes place on the leaves.
Translating all this into the algorithm language, the nodes represent the attributes, branches represent the values that features can take, and condition checks represent each attribute’s value.
Therefore, the root and nodes are the variables to evaluate. The branches that bind the nodes are the allowable values or paths to be followed in the decision-making process, and the sheets are the outputs.
Decision Trees can be used for classification and regression issues. We have an algorithm that allows us to build classification models and regression models at our disposal — the decision will depend on the target variable we are trying to predict.
When we create classification and regression models with decision trees, we create a classification tree or regression tree depending on the target variable we want to predict. In this way, we apply the decision trees algorithm, RandomForest, or a combination of the Ensemble method.
The key to building machine learning models based on decision trees is knowing how to split attributes. For example, consider a table with input data and output data representing one supervised learning problem; we have characteristics A and B. Each combination of attributes has a corresponding output.
Therefore, a logical input port receives two input signals and returns one output for each combination.
To create a decision tree representing this table, you would choose any of the entries for the root node and create the branches according to the allowable values. At the end of each component, we would have the other entrance and finally the leaves.
Why use attribute A instead of attribute B to use as a root? How many different trees can we build? These and other questions are crucial for building decision trees. Therefore, we have several possible combinations of these attributes until we get to the predictions. That’s the decision tree model.
We have techniques for deciding which attribute to start with and which sequence we put the other attributes on. We can use:
Information Gain and Entropy
Gini Index
Gain Rate
Building a Decision Tree is not as simple as it sounds. First, we have a set of data that we must divide by branches of the Trees to have the best combinations and reach an ideal result.
While interpreting a Decision Tree is easy, putting together this whole process is not so simple. Therefore, we need to start by studying information gain, Entropy, Gini Index, and applying Prunning.
It is common to make decisions without knowing why we are deciding in a certain way. It is also common to check other people’s decisions and not know what has guided them.
We need a formal method for constructing decision trees that model, that is, represent a specific inferential process. To build the trees, we need a data set and a set of rules.
From a set of rules, the construction of the tree is immediate. It is enough to place the restrictions in a hierarchical form, that is, in the order in which they are applied, and from the variables involved, the tree that would be its graphic representation is drawn. Decision tree algorithms do this whole process.
What Decision Tree algorithms do is define a set of rules using information gain, entropy, and Gini index and apply to the data to construct a sequence that represents the inference that will support the algorithm’s decision-making to classify a given data point or a prediction about a numeric value in regression problems. Therefore, with Decision Trees, we create rules to apply to the data a set of rules.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
We will discuss what solutions are available in the market for building machine learning models. We can build machine learning models in two main ways:
Developing the entire algorithm from scratch using a programming language.
Using a ready-made framework, where key algorithms are already implemented.
This theme causes controversy in the Data Science community because many prefer to develop the algorithm from the root and others the framework already ready. There is no better option than the other; it always depends on the ultimate goal.
To develop an algorithm from scratch, we need to have much experience in Machine Learning, know a lot of mathematics and statistics, and above all, be an excellent programmer — I am not.
Once we know masterfully build the model and perform the entire process of data preprocessing, algorithm training, model evaluation, and making new predictions, we can start considering development from scratch with more experience.
When we talk about the programming language, we have some that are the most used among dozens of languages available on the market:
Python
R
Scala
Java
JavaScript
Go
C++ / C#
Scikit-learn (Python)
Caret (R)
Tensorflow (Python, R, Java, C++)
Apache Mahout (Python, Java)
Spark Mllib (Scala, Java, Python, R)
H20 (Java, Python)
Weka (Java, Python)
Pytorch, CNTK, MXNet (Python, C++, Java)
There is no escaping a programming language. Even with the algorithms ready, the Frameworks require creating a model, a script, or a set of tasks using the language supported by FrameWork.
The Python language offers two main advantages over all other solutions. First, it is a commonly used language and can be used as any of these solutions. Second, Python has one of the most potent free machine learning solutions, Scikit-Learn.
Scikit-learn is a complete framework. It is a library that has much of what is needed in Machine Learning. It was made available in 2007 and has since become one of the most popular open-source tools for Machine Learning.
Classification;
Regression;
Dimensionality Reduction;
Clustering;
Selection of Attributes;
Data Pre-Processing;
Evaluation of Models.
This library is designed with an extension of the SciPy library and is based on NumPy and matplotlib. Through NumPy, we can efficiently operate large arrays of data and multidimensional arrays; that is, we represent the Array data structure. So we take our CSV, txt, or database table, load into a multidimensional array with NumPy, and process with Scikit-learn.
Through matplolib, we can use data visualization tools, SciPy provides support for scientific computing, and Pandas includes data structures used to build predictive models.
This whole set of packages mentioned above makes up the PyData Stack — Python data stack because they form a complete data analysis platform entirely free of charge.
The Scikit-learn library has excellent descriptive and conceptually explanatory documentation, is an intuitive library, offers access to multiple datasets without the need to import datasets to try the library, have a BSD license for commercial purposes and is very reliable, and can be used in professional projects to create templates directly from Scikit-learn.
These were some of the many reasons why choosing Scikit-learn for machine learning application with Python.
We are in the midst of a revolution provided by Big Data. Yet, according to recent estimates, about 2.5 quintillion bytes are generated per day across the planet. Much of this data produced is not being effectively used because of the limited human capacity to analyze this exponential flood of data.
Much of this data and its relationships is beyond our power of understanding; that is, someone or something needs to do the work for us. That’s exactly why Machine Learning is becoming increasingly popular.
For the first time in history, we have a lot of data available and processing capacity, which allows us to explore this data without the need for human intervention — we train an algorithm, build the model, and from there, it will do the work for us.
Considering a career in Machine Learning and learning everything possible on this subject is one of the smartest decisions we can make in our professional career. Soon, Machine Learning will be present in all business areas and all activities of our lives. The professionals who know how to work with these technologies will certainly be the creators of this new world.
Companies such as Google, Facebook, Amazon, Apple, IBM, Microsoft, or the world’s technology giants that invest billions of dollars in research and development, are devoting their efforts to increasingly offering machine learning-based solutions.
Machine Learning brings so many new possibilities that we will soon be dependent on Machine Learning in the same way we are already dependent on computers and the internet.
Autonomous cars, the internet of things, machine learning — technology is increasingly advancing and the future not so far away, predicts computers that act and react like humans. It is the proposal of cognitive computing — intelligent machines that learn behaviors, languages, and regionalisms to interact according to what they can interpret from the speech of their human interlocutors.
This artifact is powerful in a market where more and more consumers expect to be better served and preferably in a short period of time. Cognitive computing is based on machine learning concepts, so they are not concepts of a distant reality. Companies are already using cognitive computing solutions.
The Machine Learning process begins in defining a business problem, using specific libraries such as scikit-learn — the main Python library for machine learning, the routine process of collecting, exploratory analysis and data preprocessing, selecting variables for model performance improvement, and choosing for algorithms within a wide range of classification algorithms, regression or ensemble algorithms — for more complex solutions.
Thank you.
The idea behind an Ensemble method is to use a group of models to enhance the accuracy of predictions. Then, use an evaluation strategy by averaging the scores with the Bagging method or assigning weights to each estimator’s outputs with the Boosting method.
Therefore, an Ensemble method aims to group machine learning models and achieve a higher level of accuracy. However, great power requires great responsibility, and we will have to deal with a more significant number of model parameters. Before, we had only the parameters of a single machine learning model, and now we have groups of models that together represent an Ensemble model.
See the Jupyter Notebook for the concepts we’ll cover on building machine learning models, my Medium and LinkedIn for other Data Science and Machine Learning tutorials.
We have to work now with the parameters of the base estimator plus the parameters of the Ensemble model — finding the best combination of these parameters is not a simple task; we will need to work with hyperparameter optimization strategies.
Before anything else, we need to discern what parameters are and what are hyperparameters.
Every machine learning model has input parameters that allow the customization of the model. These parameters are also called hyperparameters.
Parameters and hyperparameters both are pretty much the same things and part of our job as Data Scientists is to find the best combination of hyperparameters for each model.
Functions in programming represent machine learning algorithms, and each function has the customization parameters, precisely what we call hyperparameters. It is also common for people to report to the model’s coefficients (found at the end of training) as parameters.
In short, when we train a Machine Learning model, the result of model training is a set of numbers! These numbers are the coefficients that are commonly called parameters.
So, we have parameters to feed and increment a function, and that’s what we call algorithm input hyperparameters. Then, after we train the model, the result of the model is a set of numbers that are the parameters or output coefficients.
Our goal is to optimize the hyperparameters to achieve satisfactory results in the model predictions and to achieve at least the proposed accuracy to solve the problem in question assertively.
estim_base = KNeighborsClassifier(
algorithm='auto',
leaf_size=30,
metric='minkowski',
metric_params=None,
n_jobs=None,
n_neighbors=5,
p=2,
weights='uniform')
BaggingClassifier(base_estimator=estim_base, bootstrap=True, bootstrap_features=False, max_features=0.5, max_samples=0.5, n_estimators=10, n_jobs=None, oob_score=False, random_state=None, verbose=0, warm_start=False)
BaggingClassifier(
base_estimator=estim_base,
bootstrap=True,
bootstrap_features=False,
max_features=0.5,
max_samples=0.5,
n_estimators=10,
n_jobs=None,
oob_score=False,
random_state=None,
verbose=0,
warm_start=False)
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
Ensemble methods allow us to work with decision tree groupings and groupings of other machine learning models that enable us to create a single method with multiple base estimators to have a much more powerful super algorithm perform our best predictions.
See the Jupyter Notebook for the concepts we’ll cover on building machine learning models and my Medium for other Data Science articles and tutorials.
Bagging is used for building multiple models (typically of the same type) from different subsets in the training dataset. Therefore, Bagging is an ensemble method that allows us to create multiple models of the same kind.
As one of the fundamental parameters, we have base_estimator, an individual model that we put into the Bagging method. If we do not set the base_estimator parameter for the BaggingClassifier algorithm, scikit-learn uses a Decision Tree:
For example, we will create ten decision trees and train them all together. In the end, we will have a single model like Bagging without necessarily using the decision tree method but any other type of machine learning algorithm.
A Bagging classifier is a meta-estimator ensemble that makes the base classifier fit, each in random subsets of the original dataset. It then aggregates its predictions (by vote or by average) to form a final prediction. It is the definition of the ensemble method.
Such a meta-estimator can typically be used to reduce the variance of an estimator (for example, a decision tree), introducing randomization into its construction procedure and making an ensemble (set) from it.
It is more advantageous to build the model that is grouping several other models because this reduces some problems, but we bring different types to issues. We want to create a model capable of making predictions with the highest level of accuracy.
First, we import some packages and functions. First, of course, the BaggingClassifier is the function that contains the Bagging Algorithm present in the ensemble package of the Sklearn library.
As a base estimator, we will not use the default decision tree but rather the KNN method through the KneighborsClassifier algorithm;
We import the load_digits that has the function that loads data set that we will use;
The scale function will allow you to pre-process the data by placing it on the same scale;
We will apply cross_val_score, i.e., cross-validation to be able to calculate the accuracy of the model;
Warning import is to disable frequent version alert messages.
# importing librariesfrom sklearn.ensemble import BaggingClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.datasets import load_digitsfrom sklearn.preprocessing import scalefrom sklearn.model_selection import cross_val_scoreimport warningswarnings.simplefilter(action='ignore', category=FutureWarning)
A função load_digits contém um conjunto de dígitos que foram escaneados e agrupados nesse conjunto de dados. A função load_digits será carregada no objeto digits:
# loading datadigits = load_digits()
Let’s view the digits through the matplotlib library, format the chart output with plt.gray, point an image of the set, and call the image with the plt.show:
import matplotlib.pyplot as plt
%matplotlib inline
plt.gray()
plt.matshow(digits.images[5])
plt.show()
<Figure size 432x288 with 0 Axes>
See that we have a grid, an array that is an image with rows and columns. Each of these squares represents a pixel with different colors.
As we have black, it indicates that there is no digit piece in the region. However, if we have white or shaded squares, it demonstrates that we have an apparent amount of digits.
What we want, after all, is to take the multi-digit dataset and present it to a Machine Learning model that will learn and be able to make predictions.
Next, we’ll put the data on the same scale. Although, if we look closely, each square has a different number, each square represents a grayscale that is the format of black and white images.
However, we cannot deliver this way to the algorithm — the algorithm does not know this. Therefore, we take all the values and put them on the same scale — a prevalent preprocessing task.
# Putting all data on the same scale
data = scale(digits.data)
Here we divide the input and output data, where X represents the attributes that are the pixels of the image and y represents the label, that is, the indication that the set-in question is the number 5. So, we have several pixels that together form a label that is the number 5.
# Predictor variables and target variable
X = datay = digits.target
So, we have several pixels that together form a label that is the number 5.
We created our classifier with the Bagging method. The BagginClassifier algorithm will create 10 KNN models using samples from our dataset. We know that there will be ten models because the number of standard estimators n_estimators is standard ten estimators.
# Classifier Construction
bagging = BaggingClassifier(KNeighborsClassifier(),max_samples = 0.5, max_features = 0.5)
bagging
That way, we have our classifier with Bagging — a fairly simple procedure. Below we can go through the parameters of the KNN:
BaggingClassifier(base_estimator=KNeighborsClassifier(
algorithm='auto',
leaf_size=30,
metric='minkowski',
metric_params=None,
n_jobs=None,
n_neighbors=5,
p=2,
weights='uniform'),
bootstrap=True,bootstrap_features=False, max_features=0.5,max_samples=0.5,n_estimators=10, n_jobs=None,oob_score=False, random_state=None, verbose=0,warm_start=False)
As Data Scientists, we can interfere with the parameters above the KNN algorithm and the BaggingClassifier algorithm itself. That is until now; we had to worry only about the parameters of a model. Now we care about the parameters of the base model and the ensemble method simultaneously.
Next, with the classifier ready, we will do the training using cross-validation to do various tests with the bagging model and multiple subsets of X and y.
In the end, we may have a slightly more accurate assessment of the classification level of the model.
?cross_val_score
We call the cross_val_score. We pass the bagging classifier that we created earlier and X and y.
# Model score
scores = cross_val_score(bagging, X, y)
Since we have multiple score values, we average these scores:
# Average score
mean = scores.mean()print(mean)0.9449045416765015
Therefore, we achieved a 94% accuracy level with the Training data using the Ensemble Bagging Method with the KNN base_estimator for classification.
Building models with the Ensemble Method is not much different from building an individual model. The difference is that we only have more parameters to adjust and create the best possible version of the model.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
Ensemble methods allow us to work with decision tree groupings and groupings of other machine learning models that enable us to create a single method with multiple base estimators to have a much more powerful super algorithm perform our best predictions.
See the Jupyter Notebook for the concepts we’ll cover on building machine learning models and my Medium for other Data Science articles and tutorials.
Bagging is used for building multiple models (typically of the same type) from different subsets in the training dataset. Therefore, Bagging is an ensemble method that allows us to create multiple models of the same kind.
As one of the fundamental parameters, we have base_estimator, an individual model that we put into the Bagging method. If we do not set the base_estimator parameter for the BaggingClassifier algorithm, scikit-learn uses a Decision Tree:
For example, we will create ten decision trees and train them all together. In the end, we will have a single model like Bagging without necessarily using the decision tree method but any other type of machine learning algorithm.
A Bagging classifier is a meta-estimator ensemble that makes the base classifier fit, each in random subsets of the original dataset. It then aggregates its predictions (by vote or by average) to form a final prediction. It is the definition of the ensemble method.
Such a meta-estimator can typically be used to reduce the variance of an estimator (for example, a decision tree), introducing randomization into its construction procedure and making an ensemble (set) from it.
It is more advantageous to build the model that is grouping several other models because this reduces some problems, but we bring different types to issues. We want to create a model capable of making predictions with the highest level of accuracy.
First, we import some packages and functions. First, of course, the BaggingClassifier is the function that contains the Bagging Algorithm present in the ensemble package of the Sklearn library.
As a base estimator, we will not use the default decision tree but rather the KNN method through the KneighborsClassifier algorithm;
We import the load_digits that has the function that loads data set that we will use;
The scale function will allow you to pre-process the data by placing it on the same scale;
We will apply cross_val_score, i.e., cross-validation to be able to calculate the accuracy of the model;
Warning import is to disable frequent version alert messages.
# importing librariesfrom sklearn.ensemble import BaggingClassifierfrom sklearn.neighbors import KNeighborsClassifierfrom sklearn.datasets import load_digitsfrom sklearn.preprocessing import scalefrom sklearn.model_selection import cross_val_scoreimport warningswarnings.simplefilter(action='ignore', category=FutureWarning)
A função load_digits contém um conjunto de dígitos que foram escaneados e agrupados nesse conjunto de dados. A função load_digits será carregada no objeto digits:
# loading datadigits = load_digits()
Let’s view the digits through the matplotlib library, format the chart output with plt.gray, point an image of the set, and call the image with the plt.show:
import matplotlib.pyplot as plt
%matplotlib inline
plt.gray()
plt.matshow(digits.images[5])
plt.show()
<Figure size 432x288 with 0 Axes>
See that we have a grid, an array that is an image with rows and columns. Each of these squares represents a pixel with different colors.
As we have black, it indicates that there is no digit piece in the region. However, if we have white or shaded squares, it demonstrates that we have an apparent amount of digits.
What we want, after all, is to take the multi-digit dataset and present it to a Machine Learning model that will learn and be able to make predictions.
Next, we’ll put the data on the same scale. Although, if we look closely, each square has a different number, each square represents a grayscale that is the format of black and white images.
However, we cannot deliver this way to the algorithm — the algorithm does not know this. Therefore, we take all the values and put them on the same scale — a prevalent preprocessing task.
# Putting all data on the same scale
data = scale(digits.data)
Here we divide the input and output data, where X represents the attributes that are the pixels of the image and y represents the label, that is, the indication that the set-in question is the number 5. So, we have several pixels that together form a label that is the number 5.
# Predictor variables and target variable
X = datay = digits.target
So, we have several pixels that together form a label that is the number 5.
We created our classifier with the Bagging method. The BagginClassifier algorithm will create 10 KNN models using samples from our dataset. We know that there will be ten models because the number of standard estimators n_estimators is standard ten estimators.
# Classifier Construction
bagging = BaggingClassifier(KNeighborsClassifier(),max_samples = 0.5, max_features = 0.5)
bagging
That way, we have our classifier with Bagging — a fairly simple procedure. Below we can go through the parameters of the KNN:
BaggingClassifier(base_estimator=KNeighborsClassifier(
algorithm='auto',
leaf_size=30,
metric='minkowski',
metric_params=None,
n_jobs=None,
n_neighbors=5,
p=2,
weights='uniform'),
bootstrap=True,bootstrap_features=False, max_features=0.5,max_samples=0.5,n_estimators=10, n_jobs=None,oob_score=False, random_state=None, verbose=0,warm_start=False)
As Data Scientists, we can interfere with the parameters above the KNN algorithm and the BaggingClassifier algorithm itself. That is until now; we had to worry only about the parameters of a model. Now we care about the parameters of the base model and the ensemble method simultaneously.
Next, with the classifier ready, we will do the training using cross-validation to do various tests with the bagging model and multiple subsets of X and y.
In the end, we may have a slightly more accurate assessment of the classification level of the model.
?cross_val_score
We call the cross_val_score. We pass the bagging classifier that we created earlier and X and y.
# Model score
scores = cross_val_score(bagging, X, y)
Since we have multiple score values, we average these scores:
# Average score
mean = scores.mean()print(mean)0.9449045416765015
Therefore, we achieved a 94% accuracy level with the Training data using the Ensemble Bagging Method with the KNN base_estimator for classification.
Building models with the Ensemble Method is not much different from building an individual model. The difference is that we only have more parameters to adjust and create the best possible version of the model.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
The algorithms we will see are supervised learning. We present the input data and output data, the algorithm learns the relationship between input and output, creating the model. When we represent new input data, the algorithm will now be able to predict the results.
The algorithms we will see are supervised learning. First, we present the input data and output data, the algorithm learns the relationship between input and output, creating the model. When we represent new input data, the algorithm will now be able to predict the results.
Objectively, we have two algorithms: regression algorithms and classification algorithms when we work with supervised learning.
We use regression algorithms when our goal is to predict a numeric value — predict house value in a given neighborhood, expect sales value for the next month.
In other situations, we want to predict a category. For example, whether or not a person has a disease, that is, a final binary answer {yes, no} or whether or not a person should receive credit from the bank based on their history as a customer, the output is once again a simple {yes, no}.
Therefore, in most of our machine learning tasks, we will work with Regression or Classification algorithms. When we are at a slightly more advanced level, we work with unsupervised learning and reinforcement learning in particular cases.
Let's start with one of the simplest machine learning algorithms that exist. Some even say that it is not necessarily a machine learning algorithm but simply a statistical analysis task, the Linear Regression algorithm, whose goal is to predict a numerical value.
Linear Regression is used to estimate actual values: cost of a house, number of phone calls, total sales, etc. These predictions are made based on continuous numeric variables. We established a relationship between dependent and independent variables, adjusting the best relationship line between these variables.
The independent variable is located on the X-axis, and the dependent variable is on the y axis. Our goal is to find the regression line that allows us to make predictions. That is, when we have the X input values (number of rooms in a house), we will look on the red line for the corresponding value of y (house price) — based on the value of the independent variable, we can predict the value of the dependent variable.
Linear Regression can be of two types: a simple linear regression — only an independent variable or multiple linear Regression — more than one independent variable.
Another fundamental algorithm in Machine Learning is Logistic Regression, especially for those who want to work with Artificial Intelligence. Careful, don't be fooled by the name! Although it takes the name regression, it is a classification algorithm.
In this chart, we have the data points in blue and the points given in red. The data in red represents 0, and blue represents 1, according to the legend. We want to find this line that separates the data between two categories. Feeding the model with new input data can make the classification, that is, if the input data belongs to the blue category or red category.
Another vital machine learning algorithm is the decision tree. We have the first variable, and the algorithm is making decisions until it can perform the classification in the final part. We have the set of input variables and an output variable that, in this case, represents two possibilities {yes, no}. The decision tree will find the best way to reach the final classification.
Decision Tree is a robust algorithm and, in general, has outstanding accuracy and is very easy to understand.
If we have a tree, we can also have a forest. We put several algorithms of decision trees and make them compete with each other. Then, we will feed them with input data. These various decision tree algorithms will work together to find the best way to exit. Finally, a vote is made to have the best possible solution among all the algorithms that worked in parallel.
In short, this is Random Forest — one of the most accurate machine learning algorithms. Furthermore, it helps create the model, as we can also use this algorithm for selecting variables for creating the model itself.
It's a real work of art translated into an accurate algorithm. The Support Vector Machines (SVM) algorithm can be used for regression or classification problems, and we can create a classifier model or a regressor model.
Generally, we work with SVM when we have nonlinearly separable data. For example, imagine that we have a data set with two variables. We have a linear separation between these two data. That is, we have clear boundaries between the variables. If this exists in the data, we can use linear Regression without significant difficulties. The Regression can do this linear division.
However, in some situations, the variables are not linearly separable. So, the SVM algorithm creators have created another dimension in the data — it's a bit abstract, but it's genius.
The idea is as if we had two variables and created a third dimension. We climbed the dimension data to separate the data even if they are not linearly separable, then we bring to the standard size and present the final classifier.
Another famous algorithm is the Naive Bayes algorithm based on Bayes' theorem, which assumes independence between predictor variables. Naive Bayes is a probabilistic algorithm. That is, it uses the essence of probability theory. It's a pretty good accuracy algorithm, but it's hard to make its adjustment.
Another algorithm is also considered one of the simplest in Machine Learning. It is an algorithm used for classification tasks and regression tasks depending only on the parameters used when constructing the model.
A simple algorithm takes the distance between the data points and uses a Euclidean mathematical distance to find the distance between the data points. The algorithm creates small centroids, and from there, it groups the data and divides between classes or predicts final values.
This algorithm is so simple that it is challenging to consider it as Machine Learning, but in the end, it ends up doing what it proposes to do — creates a model and makes predictions.
The algorithms here are non-supervised learning. We present only input data, and the data are grouped and classified by similar characteristics into clusters.
K-Means is the main unsupervised learning representative algorithm. We deliver a large volume of data to the algorithm. The algorithm doesn't know what the output is. Neither does the data scientist himself, and the algorithm alone does a cluster segmentation — ideal for customer segmentation.
We have an unsupervised learning category that is algorithms for dimensionality reduction. The Principal Component Analysis (PCA) is one of the prominent algorithms.
Imagine that we have a dataset with 300 columns, that is, 300 variables. Naturally, it would not be easy to train an algorithm with so many variables. So instead, we reduced dimensionality through main components, and each of these components encapsulates the information of a group of variables.
Therefore, instead of training an algorithm with 300 variables, we train with 3, 4, or 5 components — each component is a vector summary of the components' variables.
They're the competitors' unique algorithms. But, unfortunately, these algorithms are a little more complex to train, require more computational capacity, and achieve very high precision.
One of the leading representatives of this category is the XGBoost — algorithm that consumes many computers, but we achieved excellent results.
The objective was to present a brief general summary of standard algorithms in Machine Learning.
Thank you.
In this article, we will cover the most commonly used clustering algorithm, the K-Means algorithm. The K-Means algorithm has been studied for several decades and serves as the basis for many other more sophisticated and better-designed clustering techniques.
If we understand the simple principles behind K-Means, we will learn any other clustering algorithm. K-Means bears much resemblance to the KNN algorithm, an algorithm for supervised learning to deliver input and output data. At the same time, K-Means is a clustering algorithm where we provide only input data. The inner workings of these algorithms are very similar.
The K-Means algorithm is a partitional clustering algorithm that assigns each of the n data examples to one of the k clusters, where k is a number previously determined by the data scientist.
The goal is to minimize the differences within each cluster and maximize the differences between the groups. Unless the value of K and N are minimal, we will not calculate optimal clusters in all possible combinations of examples. Instead, the algorithm uses a heuristic process that finds the locally optimal solutions; that is, the algorithm begins with initial guessing for the cluster assignments and then slightly modifies the assignments to see whether or not the changes improve homogeneity within the clusters.
When we begin training the clustering algorithm, the algorithm started through a random group centroid guessing process and will soon discover the most vital data points in each group. At each training step, the algorithm will calculate the distance to other data points. If necessary, it will change the centroid of each cluster during the clustering until we have the data points for each of the groups according to the K value defined previously.
Creating models with the K-Means algorithm essentially involves two phases. You first assign examples to an initial set of k clusters and then update the assignments by adjusting the cluster boundaries according to the samples in the group. The upgrade and assignment process occurs multiple times until the changes no longer improve the cluster tuning. Finally, the cluster finds its limit, and at this point, the process terminates, and the collections are terminated.
The operation of K-Means is quite simple and without any significant novelty since its conception, just the Big Data that is revolutionizing the whole process in Machine Learning. Of course, the K-Means algorithm was already used in the past, but we didn't have much data as we have today to show its total value.
As the volume of data increases, K-Means can do more accurate and more intuitive work only by calculating mathematical distances between data points and overrides in clusters. The algorithm has the shortest possible distance between the data points and the most significant possible distance between the groups.
Like KNN, K-Means treats attribute values as coordinates in a multidimensional space of attributes. For our example, there are only two attributes:
We can represent the attribute space as a two-dimensional scatter diagram. The K-Means algorithm starts by choosing K points in the attribute space to serve as cluster centers. These centers are like catalysts that stimulate the remaining examples to fall or not into a given cluster. Often points are chosen by selecting K random examples from the training dataset.
Since we hope to identify 3 clusters in the example above, we will work with K equal to 3, and with this, we will have three randomly selected data points that will serve as centroids for each cluster. The colored formats indicate these dots. Although the 3 cluster centers are spaced mainly, this is not always the case. Once they are randomly selected, the three centers could have easily been three adjacent points.
Because the K-Means algorithm is susceptible to cluster centers' initial position, the random chance can substantially impact the final set of clusters, meaning how we initialize training directly affects the model's outcome.
We can modify the K-Means algorithm to use different methods for choosing the initial centers. For example, we could select random values occurring anywhere in the attribute space rather than just selecting from the values observed in the data.
Another option is to ignore this previous step completely. By randomly assigning each example to a cluster, the algorithm can immediately advance to the upgrade phase. Each of these approaches adds a particular bias to the final set of clusters. In practice, Data Scientist has a direct influence on the outcome of any Machine Learning model.
Therefore, how we initialize the training influences the final result of the algorithm. Furthermore, the amount of K clusters and all the preprocessing applied to the data influences the last development for any machine learning model.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
Fraud has been a critical issue for financial services institutions. And as global transactions continue to increase, so does the danger. Fortunately, Artificial Intelligence has enormous potential to reduce financial fraud.
With automated fraud detection tools getting more intelligent and machine learning becoming more powerful, the outlook should improve exponentially. Therefore, we’re going to investigate Artificial Intelligence and the Future of Financial Fraud Detection.
The McAfee security firm estimates that cybercrime currently costs the global economy about $600 billion, or 0.8% of global gross domestic product. One of the most prevalent forms of cybercrime is credit card fraud, exacerbated by the growth of online transactions. In addition, the speed with which financial losses can occur due to credit card fraud makes intelligent fraud detection techniques increasingly important.
Due to the availability of large volumes of customer data, along with up-to-date transactional data as transactions occur, Artificial Intelligence can be used to efficiently identify patterns of irregular credit card behavior for specific customers.
Cybersecurity companies can focus on implementing Deep Learning to create fingerprints of users and transactions.
For example, identifying the relationships between data points and reducing them to their main components can be grouped using mathematical models by creating clusters (groupings by similar characteristics) to identify the pattern of behavior towards other users in the cluster at any given time.
An added advantage of a more sophisticated model is its potential ability to use a wide variety of data points (as Mastercard has already done) to continuously adjust different customers and transactions across clusters best suited for accurate comparison. So as a customer’s life circumstances and spending habits change, the model automatically adjusts what it sees as potentially fraudulent transactions. As a result, we can reduce actual fraudulent transactions and minimize false positive fraud flags.
False positives occur regularly with anti-fraud measures based on traditional rules, where the system signals anything outside a certain standard.
For example, if we plan a trip abroad and start buying airline tickets and accommodations, this may cause a fraud notice. However, as previously described, a smarter system that can better understand the underlying patterns of human behavior could use new customer data (travel purchases) to match a different group of users (vacation travelers).
It can then test our behavior against transactions typical of the new cluster of users, vacation travelers in this example, before automatically generating a fraud flag on your account.
The potential for electronic fraud is increasing with the increased use of advanced technology and the global nature of many transactions. Therefore, it is clear that it is imperative to use the most advanced techniques to combat cybercrime.
The most exciting thing for those who hope to reduce fraudulent activity further is that we now see a new generation of algorithms based on people's thinking. For example, we can cite the Convolutional Neural Networks based on the visual cortex, a small segment of sensitive cells to specific regions of the visual field in the human body. Neural networks use images directly as input, functioning similarly to the visual cortex. It means that they can extract elementary visuals such as oriented edges, endpoints, and corners.
This new development in Artificial Intelligence makes algorithms that were already intelligent infinitely smarter. For example, this technology can study an individual’s spending data and determine, based on this information, whether they performed the most recent transaction on their credit card or if someone else was using the credit card data.
The significant potential lies in the ability of neural networks to learn relationships from modeled data, as mentioned in this world academy of science study — the implementation of this type of solution to contain cybercrime, for example, drastically reducing economic losses.
Fraud has occurred throughout human history and has become more complex and challenging to stop as technology evolves. Fortunately, we are now in a position where we can leverage technology — especially new neural networks — to identify these fraudulent activities and stop them before they cause damage.
Achieving this will reduce the overall costs of banks and improve their reputation with customers, who are likely to be more loyal to an institution that better protects their money. And banks even can channel some of the cost savings they do by reducing customer fraud in the form of lower transaction fees or reduced interest rates. So ultimately, AI is likely to create a radical change across the banking industry, leading to reduced cybercrime and happier customers. That’s truly a win-win situation.
Finally, we saved the exit of the Decision Tree. I hope you have found this helpful. Thank you for reading. 🐼
Deep Learning Book
Mastercard rolls out artificial intelligence across its global network.
AI for fraud detection: beyond the hype
How to Fight Fraud with Artificial Intelligence and Intelligent Analytics
Artificial Intelligence And The Future Of Financial Fraud Detection
Artificial Intelligence in Fraud Detection
Machine Learning and AI for Fraud Prevention
Machine Learning for Fraud Detection — Modern Applications and Risks
Take any data science book or course online. I bet the only machine learning library is scikit-learn. Of course, it's a great place to start, but these days we need something more automated, which saves our time.
PyCaret is a library that requires little code and makes us more productive. With less time spent coding, we can focus on business problems. In addition, this library allows us to prototype quickly and efficiently from your choice of notebook environment.
It is a library that is simple and easy to use, and one of the machine learning libraries will help you perform end-to-end machine learning experiments with a few fewer lines of code.
Once installed the PyCaret package, we can start our imports:
from pycaret.classification import *
from pycaret.datasets import get_data
The first is a recommended way to start with sorting tasks—although some may not be comfortable with import syntax * (avoid if possible). The second import allows us to use inline datasets.
We will use the diabetes dataset and the get_data to load the dataset. Next, we need to make some minor adjustments and tell PyCaret what the target variable is:
diabetes = get_data ('diabetes'
exp_clf = setup(diabetes, terget = 'Class variable')
Running this code will result in the following:
This Diabetes DataFrame is long and reports much information about the data. So now we can get on with the machine learning part.
This step is elementary. Just type in the following:
compare_models()
The function compare_models does ALL of this. It also highlights cells with higher scores. Logistic regression seems to do the best accuracy, but the XGBoost algorithm performs the best performance (as usual) overall—so that's the algorithm we'll use to create the model:
xgb = create_model('xgboost')
This link will find all the abbreviations of the Machine Learning models to use something other than XGBoost. Next, let's see visually how our model performs.
We can use the plot_model( ) to visualize model performance:
plot_model(xgb)
It shows the ROC curve chart by default, but this is simple to adjust. To use another performance metric, we can use confusion matrix:
plot_model(xgb, 'confusion_matrix')
Or the classification report:
plot_model(xgb, 'class_report')
Here is the link to all possible views. But, first, let's continue with the interpretation of the model.
SHAP, or Shapley additive explanations, is a way to explain the outputs of a machine learning model. We can use it to see which features are most critical:
interpret_model(xgb)
The above plot classifies characteristics by the sum of value magnitudes over all samples. It uses SHAP values to show the distribution of impacts each resource has on model output. The color represents the value of the feature— red being high and blue being low.
In a nutshell, higher plasma glucose concentration levels lead to higher chances (due to red color) for diabetes.
PyCaret automatically split the data into training and test parts (70:30) after loading, so we don't need to break the data manually.
Now we can evaluate the model on data never seen before:
predictions = predict_model(xgb)
For example, the above code produces the following results:
Having the values in hand, let's see how to save and load the model.
Before we can save the template, we need to finalize it:
finalize_model(xgb)
save_model(xgb, ‘diabetes_xgboost’)
The template is saved in pickles format.
model = load_model(‘diabetes_xgboost’)
And now we're ready to load the model again.
And there we have it. I hope you have found this helpful. Thank you for reading. 🐼
The algorithms we will see are supervised learning. We present the input data and output data, the algorithm learns the relationship between input and output, creating the model. When we represent new input data, the algorithm will now be able to predict the results.
The algorithms we will see are supervised learning. First, we present the input data and output data, the algorithm learns the relationship between input and output, creating the model. When we represent new input data, the algorithm will now be able to predict the results.
Objectively, we have two algorithms: regression algorithms and classification algorithms when we work with supervised learning.
We use regression algorithms when our goal is to predict a numeric value — predict house value in a given neighborhood, expect sales value for the next month.
In other situations, we want to predict a category. For example, whether or not a person has a disease, that is, a final binary answer {yes, no} or whether or not a person should receive credit from the bank based on their history as a customer, the output is once again a simple {yes, no}.
Therefore, in most of our machine learning tasks, we will work with Regression or Classification algorithms. When we are at a slightly more advanced level, we work with unsupervised learning and reinforcement learning in particular cases.
Let's start with one of the simplest machine learning algorithms that exist. Some even say that it is not necessarily a machine learning algorithm but simply a statistical analysis task, the Linear Regression algorithm, whose goal is to predict a numerical value.
Linear Regression is used to estimate actual values: cost of a house, number of phone calls, total sales, etc. These predictions are made based on continuous numeric variables. We established a relationship between dependent and independent variables, adjusting the best relationship line between these variables.
The independent variable is located on the X-axis, and the dependent variable is on the y axis. Our goal is to find the regression line that allows us to make predictions. That is, when we have the X input values (number of rooms in a house), we will look on the red line for the corresponding value of y (house price) — based on the value of the independent variable, we can predict the value of the dependent variable.
Linear Regression can be of two types: a simple linear regression — only an independent variable or multiple linear Regression — more than one independent variable.
Another fundamental algorithm in Machine Learning is Logistic Regression, especially for those who want to work with Artificial Intelligence. Careful, don't be fooled by the name! Although it takes the name regression, it is a classification algorithm.
In this chart, we have the data points in blue and the points given in red. The data in red represents 0, and blue represents 1, according to the legend. We want to find this line that separates the data between two categories. Feeding the model with new input data can make the classification, that is, if the input data belongs to the blue category or red category.
Another vital machine learning algorithm is the decision tree. We have the first variable, and the algorithm is making decisions until it can perform the classification in the final part. We have the set of input variables and an output variable that, in this case, represents two possibilities {yes, no}. The decision tree will find the best way to reach the final classification.
Decision Tree is a robust algorithm and, in general, has outstanding accuracy and is very easy to understand.
If we have a tree, we can also have a forest. We put several algorithms of decision trees and make them compete with each other. Then, we will feed them with input data. These various decision tree algorithms will work together to find the best way to exit. Finally, a vote is made to have the best possible solution among all the algorithms that worked in parallel.
In short, this is Random Forest — one of the most accurate machine learning algorithms. Furthermore, it helps create the model, as we can also use this algorithm for selecting variables for creating the model itself.
It's a real work of art translated into an accurate algorithm. The Support Vector Machines (SVM) algorithm can be used for regression or classification problems, and we can create a classifier model or a regressor model.
Generally, we work with SVM when we have nonlinearly separable data. For example, imagine that we have a data set with two variables. We have a linear separation between these two data. That is, we have clear boundaries between the variables. If this exists in the data, we can use linear Regression without significant difficulties. The Regression can do this linear division.
However, in some situations, the variables are not linearly separable. So, the SVM algorithm creators have created another dimension in the data — it's a bit abstract, but it's genius.
The idea is as if we had two variables and created a third dimension. We climbed the dimension data to separate the data even if they are not linearly separable, then we bring to the standard size and present the final classifier.
Another famous algorithm is the Naive Bayes algorithm based on Bayes' theorem, which assumes independence between predictor variables. Naive Bayes is a probabilistic algorithm. That is, it uses the essence of probability theory. It's a pretty good accuracy algorithm, but it's hard to make its adjustment.
Another algorithm is also considered one of the simplest in Machine Learning. It is an algorithm used for classification tasks and regression tasks depending only on the parameters used when constructing the model.
A simple algorithm takes the distance between the data points and uses a Euclidean mathematical distance to find the distance between the data points. The algorithm creates small centroids, and from there, it groups the data and divides between classes or predicts final values.
This algorithm is so simple that it is challenging to consider it as Machine Learning, but in the end, it ends up doing what it proposes to do — creates a model and makes predictions.
The algorithms here are non-supervised learning. We present only input data, and the data are grouped and classified by similar characteristics into clusters.
K-Means is the main unsupervised learning representative algorithm. We deliver a large volume of data to the algorithm. The algorithm doesn't know what the output is. Neither does the data scientist himself, and the algorithm alone does a cluster segmentation — ideal for customer segmentation.
We have an unsupervised learning category that is algorithms for dimensionality reduction. The Principal Component Analysis (PCA) is one of the prominent algorithms.
Imagine that we have a dataset with 300 columns, that is, 300 variables. Naturally, it would not be easy to train an algorithm with so many variables. So instead, we reduced dimensionality through main components, and each of these components encapsulates the information of a group of variables.
Therefore, instead of training an algorithm with 300 variables, we train with 3, 4, or 5 components — each component is a vector summary of the components' variables.
They're the competitors' unique algorithms. But, unfortunately, these algorithms are a little more complex to train, require more computational capacity, and achieve very high precision.
One of the leading representatives of this category is the XGBoost — algorithm that consumes many computers, but we achieved excellent results.
The objective was to present a brief general summary of standard algorithms in Machine Learning.
Thank you.
The algorithms we will see are supervised learning. We present the input data and output data, the algorithm learns the relationship between input and output, creating the model. When we represent new input data, the algorithm will now be able to predict the results.
The algorithms we will see are supervised learning. First, we present the input data and output data, the algorithm learns the relationship between input and output, creating the model. When we represent new input data, the algorithm will now be able to predict the results.
Objectively, we have two algorithms: regression algorithms and classification algorithms when we work with supervised learning.
We use regression algorithms when our goal is to predict a numeric value — predict house value in a given neighborhood, expect sales value for the next month.
In other situations, we want to predict a category. For example, whether or not a person has a disease, that is, a final binary answer {yes, no} or whether or not a person should receive credit from the bank based on their history as a customer, the output is once again a simple {yes, no}.
Therefore, in most of our machine learning tasks, we will work with Regression or Classification algorithms. When we are at a slightly more advanced level, we work with unsupervised learning and reinforcement learning in particular cases.
Let's start with one of the simplest machine learning algorithms that exist. Some even say that it is not necessarily a machine learning algorithm but simply a statistical analysis task, the Linear Regression algorithm, whose goal is to predict a numerical value.
Linear Regression is used to estimate actual values: cost of a house, number of phone calls, total sales, etc. These predictions are made based on continuous numeric variables. We established a relationship between dependent and independent variables, adjusting the best relationship line between these variables.
The independent variable is located on the X-axis, and the dependent variable is on the y axis. Our goal is to find the regression line that allows us to make predictions. That is, when we have the X input values (number of rooms in a house), we will look on the red line for the corresponding value of y (house price) — based on the value of the independent variable, we can predict the value of the dependent variable.
Linear Regression can be of two types: a simple linear regression — only an independent variable or multiple linear Regression — more than one independent variable.
Another fundamental algorithm in Machine Learning is Logistic Regression, especially for those who want to work with Artificial Intelligence. Careful, don't be fooled by the name! Although it takes the name regression, it is a classification algorithm.
In this chart, we have the data points in blue and the points given in red. The data in red represents 0, and blue represents 1, according to the legend. We want to find this line that separates the data between two categories. Feeding the model with new input data can make the classification, that is, if the input data belongs to the blue category or red category.
Another vital machine learning algorithm is the decision tree. We have the first variable, and the algorithm is making decisions until it can perform the classification in the final part. We have the set of input variables and an output variable that, in this case, represents two possibilities {yes, no}. The decision tree will find the best way to reach the final classification.
Decision Tree is a robust algorithm and, in general, has outstanding accuracy and is very easy to understand.
If we have a tree, we can also have a forest. We put several algorithms of decision trees and make them compete with each other. Then, we will feed them with input data. These various decision tree algorithms will work together to find the best way to exit. Finally, a vote is made to have the best possible solution among all the algorithms that worked in parallel.
In short, this is Random Forest — one of the most accurate machine learning algorithms. Furthermore, it helps create the model, as we can also use this algorithm for selecting variables for creating the model itself.
It's a real work of art translated into an accurate algorithm. The Support Vector Machines (SVM) algorithm can be used for regression or classification problems, and we can create a classifier model or a regressor model.
Generally, we work with SVM when we have nonlinearly separable data. For example, imagine that we have a data set with two variables. We have a linear separation between these two data. That is, we have clear boundaries between the variables. If this exists in the data, we can use linear Regression without significant difficulties. The Regression can do this linear division.
However, in some situations, the variables are not linearly separable. So, the SVM algorithm creators have created another dimension in the data — it's a bit abstract, but it's genius.
The idea is as if we had two variables and created a third dimension. We climbed the dimension data to separate the data even if they are not linearly separable, then we bring to the standard size and present the final classifier.
Another famous algorithm is the Naive Bayes algorithm based on Bayes' theorem, which assumes independence between predictor variables. Naive Bayes is a probabilistic algorithm. That is, it uses the essence of probability theory. It's a pretty good accuracy algorithm, but it's hard to make its adjustment.
Another algorithm is also considered one of the simplest in Machine Learning. It is an algorithm used for classification tasks and regression tasks depending only on the parameters used when constructing the model.
A simple algorithm takes the distance between the data points and uses a Euclidean mathematical distance to find the distance between the data points. The algorithm creates small centroids, and from there, it groups the data and divides between classes or predicts final values.
This algorithm is so simple that it is challenging to consider it as Machine Learning, but in the end, it ends up doing what it proposes to do — creates a model and makes predictions.
The algorithms here are non-supervised learning. We present only input data, and the data are grouped and classified by similar characteristics into clusters.
K-Means is the main unsupervised learning representative algorithm. We deliver a large volume of data to the algorithm. The algorithm doesn't know what the output is. Neither does the data scientist himself, and the algorithm alone does a cluster segmentation — ideal for customer segmentation.
We have an unsupervised learning category that is algorithms for dimensionality reduction. The Principal Component Analysis (PCA) is one of the prominent algorithms.
Imagine that we have a dataset with 300 columns, that is, 300 variables. Naturally, it would not be easy to train an algorithm with so many variables. So instead, we reduced dimensionality through main components, and each of these components encapsulates the information of a group of variables.
Therefore, instead of training an algorithm with 300 variables, we train with 3, 4, or 5 components — each component is a vector summary of the components' variables.
They're the competitors' unique algorithms. But, unfortunately, these algorithms are a little more complex to train, require more computational capacity, and achieve very high precision.
One of the leading representatives of this category is the XGBoost — algorithm that consumes many computers, but we achieved excellent results.
The objective was to present a brief general summary of standard algorithms in Machine Learning.
Thank you.
We’ll see the process of building machine learning models quickly! We have a set of activities that will always carry out during the construction of a predictive model.
These four steps involve technical exercises, mathematical and statistical procedures, programming, and business knowledge.
It is essential to know this process because each step requires different tools, techniques, and procedures — nothing more than a Data Scientist’s routine work.
Fundamentally we have four main steps in the process of building machine learning models:
Data preprocessing;
Algorithm training;
Evaluation of the model;
New predictions.
Eventually, some algorithms will require certain adjustments in this construction process, especially unsupervised learning algorithms, but the vast majority of supervised algorithms follow this highly recurrent process.
This first step is where we prepare the data. The data is collected from a source according to the definition of the business problem in which we are working, preprocess the data and divide it into subsets: training data and test data.
Some of the activities performed in data preprocessing are:
Transformation and treatment of variables;
Selection of relevant attributes of the dataset;
Dimensionality reduction with encapsulation of variables;
Sampling.
By departing the business problem or the dimensionality of the dataset, we will do some preprocessing tasks.
Once we have created the training and test subsets, we will use the training subset to train the machine learning algorithm. In this learning stage, we must select 1 out of more than 60 algorithms, train the algorithm, create several models with different adjustments and different hyperparameters, and select the best-performing model version.
We can also apply cross-validation to make the process even more accurate, use performance metrics to compare model versions during training, and optimize the model through mathematical techniques.
After the training stage is completed, we will evaluate the most prominent model. We evaluated the model by presenting it with the other subset we created, the test data. We do this because the model was trained with the training data; all the model ever seen in life was the training subset. Therefore, to ensure that the model performs well, we have to present it with data not seen before — the test data.
We used the test data to evaluate the model because we made the same preprocessing of the training data, came from the same data source, and we also know the results of these test data. That is, we will be able to evaluate the performance of the model.
We don’t use data other than the test data for our assessment because we don’t know it yet. We know the past of test data and where it came from, and how it was processed. Therefore, we can evaluate the model with this set that was reserved exclusively for testing the model.
When we finish the evaluation, we can present new data — it will not be data, either training or test data; it is the data we want to use to make predictions with the model.
For example, if we want to predict the price of houses in a given region, we first collect historical data from the places {size, number of rooms, number of, bathrooms and}. With this data, we preprocess, eventually reduce dimensionality, train the algorithm, create the model, sort the test data to evaluate the model’s performance, and evaluate it.
That’s all we saw in the first three stages. From here, we collect data from other homes, data that we haven’t worked on before; we’ll use this new data to put the model into practice and make price predictions for new homes.
This is the standard process of building machine learning models — it may seem simple when we look at the theory, but it’s very laborious in practicing the whole process. These four steps involve a series of technical activities, procedures, mathematics, statistics, programming, and business knowledge. Depending on the volume of data Hadoop and Spark will be involved, access to data in Data Lake, data logging in HDFS, and so on.
And there we have it. I hope you have found this useful. Thank you for reading.
The increasing use of Machine Learning models has made the Data Analysis process much easier and less chaotic when it comes to extracting and processing complex Big Data sets.
Data engineers and developers now use ML to make more precise and assertive decisions. As the popularity of Machine Learning algorithms increases, there is a growing demand for efficient and versatile tools like Scikit-Learn — Knowledge of this platform has become an essential requirement for professional data scientists and ML engineers.
Scikit-Learn meets the needs of beginners in the field as well as those who solve supervised learning problems. In this article, we will cover what Scikit-Learn is, its key features and applications, and explain how this library works in practice with examples.
Scikit-Learn is a free, open-source library for Machine Learning in Python. It provides an efficient selection of resources for statistical modeling, data analysis, and mining, as well as support for supervised and unsupervised learning. Considered one of the most versatile and popular solutions in the market, it is built on interaction with other Python libraries, including NumPy, SciPy, and Matplotlib.
With tools for model fitting, selection, and evaluation, as well as data pre-processing, Scikit-Learn is considered the most useful and robust library for Machine Learning in Python.
As a high-level library, it allows for defining predictive data models in just a few lines of code. If you are looking for an introduction to ML, Scikit-Learn is well-documented, relatively easy to learn and use. Some of the main algorithms available in the library include:
Linear Regression is used in various areas, such as sales forecasting, trend analysis, and price prediction.
A model that seeks to establish a linear relationship between the independent variables and the continuous dependent variable. The goal of linear regression is to find the equation that best describes the relationship between the variables, in order to predict values of the dependent variable for new values of the independent variable.
The equation of linear regression is a straight line that represents the relationship between the variables. It is possible to find the best regression line using the method of least squares, which minimizes the sum of the squares of the differences between the predictions of the line and the actual values of the dependent variable.
For example, it can be used to predict the price of a house based on its characteristics, such as area, number of rooms, and location, or to predict the sales of a product based on investment in advertising and time of year.
from sklearn.linear_model import LinearRegression
import pandas as pd
# Load the dataset
data = pd.read_csv("house_prices.csv")
# separates independent (features) and dependent (prices) variables
X = data.drop("price", axis=1)
y = data["price"]
# create the linear regression model
model = LinearRegression()
# fit the model to the data
model.fit(X, y)
# perform a prediction for a new set of features
new_house= [[1500, 3, 2]] # area, rooms, bathrooms
price= model.predict(nova_casa)
print("Expected price for the new house:", price)
It is important to remember that it is necessary to do an exploratory analysis of the data and properly pre-process it before applying a machine learning model.
In addition, there are many other techniques and regression models available in the library, each with its own advantages and disadvantages.
Linear Regression can also be expanded to multiple regression models, which include more than one independent variable, or to nonlinear regression models, which use other forms of equations to model the relationship between the variables.
Logistic Regression is a supervised learning algorithm used for binary classification problems, such as detecting spam, predicting university admission based on a set of attributes, or detecting credit card fraud.
It is used to find the relationship between one or more independent variables and the probability of a particular class being chosen.
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# load the iris dataset
iris = load_iris()
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)
# create the logistic regression model
logreg = LogisticRegression()
# fit the model to the training data
logreg.fit(X_train, y_train)
# predict the target values for the test data
y_pred = logreg.predict(X_test)
# print the accuracy score of the model
print("Accuracy:", logreg.score(X_test, y_test))
This code loads the iris dataset, splits it into training and testing sets, creates a logistic regression model, fits the model to the training data, and predicts the target values for the test data. Finally, it prints the accuracy score of the model.
Logistic regression can also be extended to multiclass classification problems using an approach called “One-vs-All,” where multiple logistic regressions are trained for each class and then the class with the highest probability is chosen.
A decision tree can be used to determine the diagnosis of a disease based on symptoms, medical history, and test results, to predict purchasing based on web browsing behavior, or to assess the creditworthiness of a candidate based on their financial and employment information.
Decision Trees are built from a training dataset, where each node in the tree asks a question about an attribute of the dataset, and the answer determines which path to follow. At the end of the tree, each leaf represents a class or a regression value.
To build the tree, the algorithm recursively divides the data into smaller subsets based on criteria of impurity, such as entropy or the Gini index, until all samples at a node belong to the same class or present a homogeneous value for a regression variable.
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier
# load the iris dataset
iris = load_iris()
# separate the features (independent variables) and target (dependent variable)
X = iris.data
y = iris.target
# create a Decision Tree classifier
clf = DecisionTreeClassifier()
# fit the model to the data
clf.fit(X, y)
# use the model to make predictions
new_observation = [[5.2, 3.1, 4.2, 1.5]] # a new observation to predict
prediction = clf.predict(new_observation)
print("Prediction for the new observation:", prediction)
This code uses the load_iris function from scikit-learn to load the famous Iris dataset, which consists of 150 observations of iris flowers, with four features for each observation (sepal length, sepal width, petal length, and petal width), and a target variable indicating the species of each flower (setosa, versicolor, or virginica).
The code then separates the features and target from the dataset and creates a DecisionTreeClassifier object, which is fit to the data using the fit method. Finally, a new observation is used to make a prediction with the predict method, and the result is printed to the console.
Random Forest is used in various classification and regression problems, such as sales forecasting, sentiment analysis, fraud detection, medical diagnosis, and many others.
This algorithm uses multiple decision trees to perform classification or regression of data with different random subsets of input variables, and combining the predictions of each tree to produce a single prediction.
Each tree in the Random Forest is built using a technique of random sampling of training data, where each tree is trained on a random subset of the input data. This process is known as “bagging” and helps to avoid overfitting, as the Random Forest has a large variety of models to predict the response.
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
# generate a random dataset
X, y = make_classification(n_features=4, random_state=0)
# create a random forest classifier with 100 estimators
rf = RandomForestClassifier(n_estimators=100, random_state=0)
# fit the model to the data
rf.fit(X, y)
# predict the class of a new observation
new_observation = [[-2, 2, -1, 1]]
print("Predicted class:", rf.predict(new_observation))
This code generates a random dataset, creates RandomForestClassifier object with 100 estimators, fits the model to the data, and predicts the class of a new observation.
The main advantage of Random Forest is its ability to handle complex and high-dimensional problems, producing accurate predictions even on datasets with many features. Additionally, it allows for the interpretation of results, as it is possible to evaluate the relative importance of each variable in decision making.
SVMs are is a powerful supervised algorithm for classification and can be used in a variety of applications, such as image classification, text classification, fraud detection, medical diagnosis, pattern recognition, among others.
The algorithm involves finding the hyperplane that best separates the input data classes. The hyperplane is defined as the surface that maximizes the distance between the two classes, called the margin.
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn import svm
# Load the iris dataset
iris = datasets.load_iris()
# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)
# Create an SVM classifier with a linear kernel
clf = svm.SVC(kernel='linear')
# Train the SVM classifier on the training set
clf.fit(X_train, y_train)
# Make predictions on the test set
y_pred = clf.predict(X_test)
# Print the accuracy of the classifier
accuracy = clf.score(X_test, y_test)
print("Accuracy:", accuracy)
In this example, we first load the iris dataset and split it into training and testing sets.
We then create an SVM classifier with a linear kernel and train it on the training set. Finally, we make predictions on the test set and print the accuracy of the classifier.
The main advantage of SVM is the ability to separate classes with high dimensionality and non-linear data. Additionally, SVM is relatively robust to outliers and has the ability to handle problems with a large number of independent variables. However, choosing the kernel and parameters can be a challenge, and the training time may be longer than in other classification algorithms.
Naive Bayes is a supervised machine learning algorithm used in classification and text analysis problems. It is based on Bayes’ theorem and the assumption of conditional independence between input (x) variables.
The algorithm assumes that each input variable is independent of the others, meaning that the presence or absence of a particular feature does not affect the probability of the presence or absence of other features.
Naive Bayes is used in various applications such as sentiment analysis, text categorization, spam detection, document classification, among others. It is particularly effective in problems with many independent variables, where other machine learning algorithms may not be able to handle the high dimensionality.
from sklearn.naive_bayes import GaussianNB
import numpy as np
# training data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])
# create Naive Bayes classifier and fit to the data
clf = GaussianNB()
clf.fit(X, Y)
# make a prediction for a new data point
new_point = [[0, 0]]
prediction = clf.predict(new_point)
print("Prediction:", prediction)
In this example, we’re using the Gaussian Naive Bayes classifier to classify data points into one of two classes. We create a training dataset with two features (x and y coordinates) and their corresponding class labels, and then fit the classifier to this data. Finally, we make a prediction for a new data point with coordinates (0, 0). The classifier predicts that this new data point belongs to class 1.
Naive Bayes is fast, efficient, and easy to implement. It requires a relatively small training set to estimate the probabilities of input and output data, and can handle categorical or numerical data. One of the disadvantages of Naive Bayes is the assumption of conditional independence, which may not be realistic in some cases.
KNN is a supervised machine learning algorithm used in classification and regression problems. The algorithm consists of finding the K nearest neighbors to a new input data point, from a training data set. Then, the algorithm classifies the new data point according to the majority class of the K nearest neighbors.
The value of K is a hyperparameter that can be adjusted to improve the accuracy of the algorithm. A small K value may result in a classification that is more sensitive to noise in the data set, while a large K value may smooth decision boundaries and reduce the effect of noise.
KNN is used in various applications, such as pattern recognition, image analysis, anomaly detection, product recommendation, among others. It is particularly useful in problems with few independent variables and a large amount of training data.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# load the iris dataset
iris = load_iris()
# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3)
# create a kNN classifier with k=3
knn = KNeighborsClassifier(n_neighbors=3)
# fit the classifier to the training data
knn.fit(X_train, y_train)
# predict the classes of the testing set
y_pred = knn.predict(X_test)
# print the accuracy of the classifier
accuracy = knn.score(X_test, y_test)
print("Accuracy:", accuracy)
This code loads the iris dataset, splits it into training and testing sets, creates a kNN classifier with k=3, fits the classifier to the training data, and then predicts the classes of the testing set. Finally, it prints the accuracy of the classifier on the testing set.
One of the main disadvantages of KNN is the need to store all the training data, which can make the algorithm slow and consume a lot of memory on large data sets. In addition, choosing the value of K can be a challenge, and the algorithm may have difficulties handling input data with many independent variables.
Gradient Boosting is used in various applications, such as time series forecasting, fraud detection, image classification, among others.
It is especially useful in problems with high-dimensional data and a wide range of features, as is common in text analysis problems.
Gradient Boosting has many advantages, such as high accuracy and the ability to handle complex datasets. It is a highly flexible algorithm that can be used with a wide range of loss and learning functions. The algorithm is also capable of handling categorical and missing data.
from sklearn.datasets import make_classification
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# generate a random binary classification dataset
X, y = make_classification(n_samples=1000, n_features=10, n_informative=5, random_state=42)
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# create a gradient boosting classifier with default parameters
clf = GradientBoostingClassifier()
# train the model on the training data
clf.fit(X_train, y_train)
# make predictions on the test data
y_pred = clf.predict(X_test)
# calculate the accuracy of the model
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
In this example, we first generate a random binary classification dataset using the make_classification function from scikit-learn.
We then split the data into training and testing sets using the train_test_split function.
Next, we create a GradientBoostingClassifier with default parameters, fit it on the training data, and make predictions on the test data.
Finally, we calculate the accuracy of the model using the accuracy_score function from scikit-learn.
However, the implementation of Gradient Boosting can be complex, and parameter tuning can be challenging. Additionally, the algorithm can be slow on large datasets and may struggle to handle imbalanced data.
Artificial Neural Networks (ANNs) are machine learning algorithms inspired by the functioning of the human brain. They are composed of multiple layers of interconnected neurons that are capable of learning from data and performing tasks such as classification, regression, pattern recognition, among others.
Each neuron in an ANN receives a set of inputs, applies a non-linear transformation, and produces an output. The layers of neurons in an ANN are organized into an architecture, which can be of various types, such as fully connected, convolutional, recurrent, among others.
ANNs are used in various machine learning applications, such as speech recognition, image recognition, natural language processing, among others. They can be particularly useful in tasks that involve non-linear and high-dimensional data.
from sklearn.neural_network import MLPClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# load the dataset
data = load_iris()
# split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=1)
# create the neural network model
model = MLPClassifier(hidden_layer_sizes=(10,))
# train the model on the training data
model.fit(X_train, y_train)
# evaluate the model on the testing data
accuracy = model.score(X_test, y_test)
print("Accuracy:", accuracy)
In this example, we load the iris dataset and split it into training and testing sets.
Then we create an MLPClassifier model with a single hidden layer of 10 neurons, and train it on the training data.
Finally, we evaluate the accuracy of the model on the testing data.
However, ANNs can be computationally intensive and require large amounts of data for training. Choosing the correct architecture and parameters is critical to achieving good performance, and the interpretability of the results can be a challenge. Additionally, ANNs may suffer from overfitting in small or complex datasets, and it can be difficult to explain how decisions are made.
Principal Component Analysis is a dimensionality reduction technique used to identify the main variables in a dataset.
It is used to find a subset of variables that explain most of the variability in the original data. PCA seeks to transform a set of correlated variables into a new set of uncorrelated variables, called principal components.
PCA is widely used in machine learning applications, especially in data pre-processing and exploratory data analysis. It can be used to identify patterns in the data, identify outliers, reduce the dimensionality of the data, and for data visualization in low-dimensional spaces.
from sklearn.datasets import load_iris
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
# Load the iris dataset
iris = load_iris()
# Apply PCA to the dataset
pca = PCA(n_components=2)
X_pca = pca.fit_transform(iris.data)
# Plot the first two principal components
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=iris.target)
plt.xlabel('First Principal Component')
plt.ylabel('Second Principal Component')
plt.show()
This code loads the iris dataset and applies PCA to reduce the data to only two principal components.
It then plots these two components, colored by the species of the iris. This is a simple example of using PCA for dimensionality reduction and visualization.
One of the main advantages of PCA is the ability to reduce the dimensionality of the data without losing much information.
However, interpreting the principal components can be difficult, especially when many variables are involved. Additionally, PCA assumes that the original data is normally distributed and linearly correlated, which may not be true in all cases.
LDA is a supervised machine learning technique used for data classification.
It seeks to find a linear combination of the independent variables that best separates the data classes. LDA assumes that the data is normally distributed and that the covariances are equal for all classes.
LDA is often used for dimensionality reduction, as it can be used to project the data into a lower-dimensional space that best separates the classes.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.datasets import load_iris
# Load the iris dataset
iris = load_iris()
# Separate the features and target variable
X = iris.data
y = iris.target
# Create an instance of the LinearDiscriminantAnalysis class
lda = LinearDiscriminantAnalysis()
# Fit the LDA model to the data
lda.fit(X, y)
# Transform the data to the new coordinate system
X_lda = lda.transform(X)
# Print the first three rows of the transformed data
print(X_lda[:3])
In this code, we load the iris dataset and separate the features and target variable.
Then, we create an instance of the LinearDiscriminantAnalysis class, fit the LDA model to the data, and transform the data to the new coordinate system.
Finally, we print the first three rows of the transformed data. LDA has been used in various applications, including pattern recognition, image processing, fraud detection, among others.
However, LDA has some limitations, such as the assumption of normality and equality of covariances, which may not hold true in some datasets. Additionally, LDA is less robust in data with many outliers or imbalanced in terms of the number of samples per class.
K-Means is widely used in various fields such as customer segmentation, image analysis, and document clustering.
For example, it can be used to group customers into different segments based on their purchasing characteristics, such as age, gender, and purchase history, or to segment images of a field of stars into groups of stars with similar characteristics, such as brightness and color.
from sklearn.cluster import KMeans
import numpy as np
# Create some example data
X = np.array([[1, 2], [1, 4], [1, 0],
[4, 2], [4, 4], [4, 0]])
# Create a KMeans model with 2 clusters
kmeans = KMeans(n_clusters=2)
# Fit the model to the data
kmeans.fit(X)
# Print the cluster assignments
print(kmeans.labels_)
This code creates a 2-dimensional dataset and uses KMeans to cluster the data into 2 clusters. The resulting cluster assignments are printed to the console.
One of the main limitations of k-means is the need to pre-define the number of clusters (k) to be found, which can be a problem in some cases.
Additionally, k-means assumes that the shapes of the clusters are spherical and that the variances between clusters are equal, which is not always true in practice. There are other clustering techniques, such as hierarchical clustering, that may be more suitable in some cases.
These are just a few examples of the machine learning algorithms available in the library. Scikit-Learn also offers a variety of utilities and functions for pre-processing and model evaluation, as well as advanced features such as hyperparameter tuning and machine learning workflow pipelines.
In conclusion, Scikit-Learn is one of the leading Machine Learning libraries in Python, offering a wide range of algorithms and tools for predictive modeling and data analysis. It is important to remember that each algorithm has its own limitations and assumptions, so it is important to choose the appropriate technique for the problem at hand and ensure that the input data meets the requirements of the chosen algorithm.