Principal Component Analysis (PCA) is a dimensionality reduction technique used to reduce the number of variables in a dataset while retaining as much variance as possible. It works by finding new axes, called principal components, that capture the most significant features of the data. These components are orthogonal and ranked by the amount of variance they explain. PCA is useful when you need to reduce the complexity of a dataset for visualization, clustering, or to avoid overfitting in machine learning models.
CLustering is a type of unsupervised learning technique in machine learning used to group similar data points together based on their characteristics or features. The goal of clustering is to find patterns or natural groupings in the data without having predefined labels. Each group or cluster formed contains items that are more similar to each other than to those in other groups.
Association Rule Mining (ARM) is a data mining technique used to discover interesting relationships or patterns among a large set of variables in transactional data. The purpose of ARM is to find strong associations or correlation relationships among different items in a dataset. ARM is typically used in market basket analysis, where the goal is to find patterns like "if a customer buys item A, they are likely to buy item B."
Support: The frequency of an itemset appearing in transactions.
Confidence: The likelihood that item B appears in transactions that contain item A.
Lift: A measure of how much more likely item B is to be observed in transactions that contain item A than expected.
The Naïve Bayes (NB) algorithm is a probabilistic classifier based on applying Bayes' theorem with the assumption of conditional independence between features. NB is commonly used for text classification, spam detection, sentiment analysis, and recommendation systems because it is fast and effective with a relatively small amount of data. Naïve Bayes assumes that the presence of a particular feature in a class is independent of any other feature, which simplifies computations and often provides robust performance.
Decision Trees (DT) is a supervised learning method used for classification and regression. They split data into branches based on feature values, constructing a tree like structure where each node represents a decision rule, and each leaf node represents an outcome or prediction. DTs are widely used in classification tasks because they are interpretable, handle non-linear relationships well, and do not require feature scaling.
Linear regression is a statistical method that models the relationship between a dependent variable and one or more independent variables by fitting a linear equation to the observed data. It predicts continuous values by minimizing the difference between the actual and predicted values. The best-fit line is determined by minimizing the mean squared error.
Logistic regression is a classification technique that models the probability of a binary outcome using a logistic sigmoid function. Unlike linear regression, logistic regression outputs probabilities and is commonly used when the target variable is categorical, often binary between 0 to 1.
Support Vector Machine is supervised machine learning models primarily used for classification tasks. SVM is linear separators at their core, but they can be extended to perform nonlinear classification using kernels.
Ensemble learning is a machine learning technique where multiple models, often called weak learners or base models, are combined to produce a stronger, more accurate model.