Welcome to Foundation of Data Science Laboratory
Welcome to Foundation of Data Science Laboratory
Assignment 4:
Hypothesis test
A hypothesis test is a statistical method used to make inferences or draw conclusions about a population based on sample data. It involves two competing statements:
Null Hypothesis (H0): This is a statement of no effect or no difference, serving as the default assumption.
Alternative Hypothesis (H1 or Ha): This represents what you want to prove, indicating an effect or a difference.
The process typically involves the following steps:
Formulate the hypotheses: Define H0 and H1.
Choose a significance level (α): Commonly set at 0.05, this determines the threshold for rejecting H0.
Collect data: Gather sample data relevant to the hypotheses.
Calculate a test statistic: Use the sample data to compute a statistic that measures how far the sample result is from what H0 predicts.
Make a decision: Compare the test statistic to a critical value or use a p-value to decide whether to reject or fail to reject H0.
If you reject H0, it suggests that there is enough evidence to support H1. If you fail to reject H0, it means there is insufficient evidence to support the alternative hypothesis.
Library Used:
Scikit-learn is not the only library used for hypothesis testing in data science, although it is a popular choice for machine learning and model evaluation. Here are some other libraries commonly used for hypothesis testing:
1. SciPy: This library provides a wide range of statistical functions, including various hypothesis tests (e.g., t-tests, chi-squared tests, ANOVA).
2. Statsmodels: This library is specifically designed for statistical modeling and includes tools for hypothesis testing, regression analysis, and more detailed statistical tests.
3. Pandas: While primarily a data manipulation library, it can also be used to perform basic statistical tests on data.
4. NumPy: While not focused on hypothesis testing, it provides foundational numerical functions that can support statistical calculations.
5. R: Although not a Python library, R is a language specifically designed for statistical analysis and offers a wide range of hypothesis testing methods.
In data science, various statistical tests can be performed to assess hypotheses about data. Here’s a list of some common tests:
T-test:
One-sample t-test: Compares the mean of a single sample to a known value.
Independent two-sample t-test: Compares the means of two independent groups.
Paired sample t-test: Compares means from the same group at different times.
Z-test:
Used when the sample size is large (typically n > 30) or when the population variance is known.
Chi-squared test:
Tests the relationship between categorical variables (e.g., goodness-of-fit, test of independence).
ANOVA (Analysis of Variance):
Compares means across three or more groups.
Mann-Whitney U test:
A non-parametric test for comparing two independent groups.
Wilcoxon signed-rank test:
A non-parametric test for comparing two related samples.
Kruskal-Wallis test:
A non-parametric alternative to ANOVA for comparing three or more groups.
F-test:
Used to compare the variances of two populations.
Correlation tests:
Pearson correlation: Measures the linear relationship between two continuous variables.
Spearman rank correlation: A non-parametric measure of rank correlation.
Regression analysis:
Tests the relationship between dependent and independent variables, including hypothesis tests on coefficients.
These are just a few examples. The choice of test depends on the data characteristics, such as distribution, sample size, and whether the data is categorical or continuous.
Confusion Matrix
A confusion matrix is a performance measurement tool for classification models in machine learning. It helps evaluate how well a model performs by summarizing the results of predictions made by the model. Here's how it works and what it's used for:
For a binary classification problem, a confusion matrix typically has four components:
True Positives (TP): The number of correct positive predictions.
True Negatives (TN): The number of correct negative predictions.
False Positives (FP): The number of incorrect positive predictions (also known as Type I error).
False Negatives (FN): The number of incorrect negative predictions (also known as Type II error).
Performance Evaluation:
It allows you to calculate various metrics like accuracy, precision, recall, and F1-score.
Identifying Errors:
By examining TP, TN, FP, and FN, you can understand where your model is making mistakes and which classes are being confused.
Model Comparison:
It provides a clear overview that helps compare different models or tuning parameters based on their classification performance.
Threshold Adjustment:
You can analyze how changing the decision threshold affects the counts of TP, TN, FP, and FN, helping in fine-tuning the model.
Multi-class Classification:
For problems with more than two classes, confusion matrices can be expanded to show the counts for each class, allowing for detailed performance analysis across all classes.
Accuracy: (TP+TN)/(TP+TN+FP+FN)(TP + TN) / (TP + TN + FP + FN)(TP+TN)/(TP+TN+FP+FN)
Precision: TP/(TP+FP)TP / (TP + FP)TP/(TP+FP) (also called positive predictive value)
Recall: TP/(TP+FN)TP / (TP + FN)TP/(TP+FN) (also called sensitivity or true positive rate)
F1 Score: 2×(Precision×Recall)/(Precision+Recall)2 \times (Precision \times Recall) / (Precision + Recall)2×(Precision×Recall)/(Precision+Recall)
Overall, the confusion matrix is a valuable tool for diagnosing the performance of classification models and guiding improvements.