The mode is the value that appears most frequently in a dataset.
Example:
cat, cat, dog, dog, dog, elephant
The mode is dog, as it occurs 3 times, which is more than any other value.
We try to figure out which features (age, or smoker) is the best preductor for cancer
Age Smoker Target
Teenager Yes No cancer
Elderly No Cancer
Adult Yes Cancer
Elderly Yes Cancer
Adult No No Cancer
Elderly No Cancer
Adult Yes Cancer
Elderly No No cancer
Entropy quantifies the impurity or randomness in the dataset:
Entropy= sum(pi * log2(pi) ) where p is proportion of class iii in the dataset.
From your data:
Cancer: 5 occurrences
No Cancer: 3 occurrences
The probabilities:
p(Cancer)=5/8, p(No Cancer)=3/8
Entropy(Target)=−(5/8 log2 (5/8)+ 3/8 log2(3/8) = 0.95
To calculate the entropy for features, we split the dataset based on each feature and calculate the weighted sum of the entropies of the subsets.
Feature: Age
Split into groups: Teenager, Adult, and Elderly.
Teenager: 1 (No Cancer)
Adult: 3 ( 2 Cancer, 1 No Cancer)
Elderly: 3 (3 Cancer, 1 No cancer)
For each group, calculate its entropy and weight it by the proportion of the dataset it represents.
Entropy_Age= (Entropy_teenager * number of teenager + Entropy_adult* number of adult + Entropy_elderly * elderly) /total number of people
Entropy_Teenager: 0 * log2(0) + 1* log2(1) =0
Entropy_Adult: −2/3⋅log2 (2/3) - 1/3⋅log2(1/3)=0.92
Entropy_Elderly: -3/4 log2(3/4) - 1/4* log2(1/4)= 0.81
Now combine them:
Entropy_Age= (1 teen * 0 + 3 adult *0.92 + 4 elderly *0.81)/ 8 people =0.75
Feature: Smoker
Split into groups: Yes and No.
Yes: 4 (3 Cancer, 1 No Cancer)
No: 4 (2 Cancer, 2 No Cancer)
For each group:
Yes: -3/4*log2(3/4) -1/4*log2(1/4)=0.81
No: -2/4*log2(2/4)-1/2(log2(2/4) =01
Now combine them:
Entropy_Smoker)= (4 Smoker * 0.81 + 4 Non smokers * 1)/8 people = 0.835
To calculate the information gain for each feature:
Information Gain=Entropy(Target)−Entropy(Feature)
For Age:
Information Gain(Age)=0.95−0.75=0.2
For Smoker:
Information Gain(Smoker)=0.950−0.835=0.15
So, the root node of the decision tree would be Age.
Logistic regression predicts the probability of an outcome (e.g., belonging to class 1) based on a linear model and a sigmoid transformation.
Limitations of Logistic Regression:
Logistic regression assumes linearity between features and the log-odds, which may not hold true in all cases.
The method can struggle with highly imbalanced datasets or when features are highly correlated.
Linear Model: The linear relationship is defined as:
y=ax+b
Here, a is the coefficient (slope), b is the intercept, and x is the input data.
Sigmoid Transformation: The raw output y is transformed into a probability using the sigmoid function:
p=1/(1+e(−y))
This maps y into the range [0, 1].
Thresholding: A threshold (typically 0.5) is applied to classify the output:
p≥0.5 Classify as 1 (positive class).
p<0.5 Classify as 0 (negative class).
data(x) linear model sigmoidal transformation (proba) threshold (classification)
2 y=0 0.5 0
3 y=5 0.993 1
1 y= -5 0.007 0
2.5 y=2.5 0.924 1
CI =sample mean(n) +/- Z * std(n)/sqrt(n) where n is the samples size
Z-value from the standard normal distribution corresponding to the confidence level (e.g., Z=1.96 for 95%)