This study uses data to find patterns and predict if someone is an introvert or extrovert based on their behavior. The dataset includes 7 behavioral attributes. We used Python to analyze the data and apply different models to find patterns and make predictions.
Using three main methods:
Naive Bayes
Decision Tree
Random Forest
Souce: Kaggle - Extrovert vs. Introvert Behavior Data by Rakesh Kapilavai
Data Size: 2,900 participants
To prepare the data for analysis:
Missing values were removed to ensure clean input for modeling.
Categorical responses (Yes/No) were encoded numerically (1/0).
Data was split into training (70%) and testing (30%) — one for training and one for testing the prediction models.
Data summary after cleaning process:
Total Data: 2477
Training Data: 1733
Testing Data: 744
Naive Bayes calculates the probability of a person being an introvert or extrovert based on their behavior.
Formula:
A decision tree chooses features based on Information Gain or Gini Index. It chooses the best question to ask first using something called Information Gain, which helps pick the most useful behavior for splitting the data.
Formula Entropy (for information gain):
Formula Information Gain:
Random Forest combines many decision trees. To improve accuracy by combining many decision trees
Formula:
A confusion matrix is a table used to evaluate the performance of a classification model (like Naive Bayes, Decision Tree, or Random Forest).
It compares the model’s predictions with the actual results.
True Positive (TP): Correctly predicted as extrovert
True Negative (TN): Correctly predicted as introvert
False Positive (FP): Incorrectly predicted as introvert/extrovert
False Negative (FN): Missed prediction (was introvert/extrovert, but model got it wrong)
These metrics help us understand how well our model is working — especially for predicting if someone is an introvert or extrovert.