Role of hypothesis testing in machine learning

Introduction

Laymen explanation

For ML training, sample collection is very important. But how do you know if your sample is representative of the whole population? A representative sample is one which is drawn without bias from the population of interest. If you are interested to know about method to validate sample data, then this document helps.

Technical explanation

The process of hypothesis testing is to draw inferences or some conclusion about the overall population or data by conducting some statistical tests on a sample.

Testing types

Null hypothesis

The null hypothesis is the one to be tested. For example A person(say Amit) is innocent (didn't make crime)

Alternate hypothesis

Alternate hypothesis is complementary of null hypothesis. So, in above example, alternate hypothesis is Amit is not innocent

Significance of null hypothesis

Approach

- Set the Hypothesis
- Set the Significance Level, Criteria for a decision
- Compute the test statistics
- Make a decision

Errors in hypothesis testing

Type I error

It is the rejection of a null hypothesis which is true in reality. For example, an innocent person(Amit) is convicted.

Type II error

It is the non-rejection of a false null hypothesis. For example, a guilty person is not convicted.

Statistical tools

z-test and t-test

It is used to compare quantitative data to check if they came from the same population.

The z-statistic is a measure of how much an observed statistic differs from an expected statistic put forward by the null hypothesis.

Here sigma used is not the standard error of the observed data, but the standard error for the population.

t-Test is modified version of t-test where we compute the mean and standard deviation of the sample. So, t-test doesn't need variance as input.

The job of the p-value(Refer above diagram) is to decide whether we should accept our Null Hypothesis or reject it.

Chi-square test

It is used to compare categorical variables from a single population.

F-test

The F-test for linear regression tests whether any of the independent variables in a multiple linear regression model are significant

Central limit theorem use

According to the central limit theorem, the distribution of the sample mean follows a normal distribution. t-test relies on this property. Z-test uses the variance relation between the population and sample given by this theorem.

Role in machine learning

- t-test/z-test can be used to check if two samples are drawn from different population.

F-test can be used to improve your linear regression model by making it more complex i.e. by adding more linear regression variables to it.
- Before launching a new feature, hypothesis test can be used to predict possible impact of a new feature launch to customer. For example, Netflix can analyse if a new feature can increase user view time.

Point to remember

1. Hypothesis testing aim is to reject null hypothesis. If it can't be rejected, then it doesn't mean that null hypothesis is acceptable.

t-test and z-test can be used only if the population follows normal distribution. Below are other criteria

Reference

https://www.datacamp.com/community/tutorials/hypothesis-testing-machine-learning

https://en.wikipedia.org/wiki/False_positives_and_false_negatives

https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

https://images.app.goo.gl/6BeSy7hXsCPYpbLA8

https://images.app.goo.gl/2ggYNhu6tg2kfs8V6

https://medium.com/dataseries/hypothesis-testing-in-machine-learning-what-for-and-why-ad6ddf3d7af2

https://www.investopedia.com/terms/t/t-test.asp

https://images.app.goo.gl/Zo7AqFsczr74oy1TA

https://www.analyticsvidhya.com/blog/2020/06/statistics-analytics-hypothesis-testing-z-test-t-test/

https://towardsdatascience.com/introduction-tfrom-the-central-limit-theorem-to-the-z-and-t-distributions-66513defb175

https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/more-significance-testing-videos/v/z-statistics-vs-t-statistics

https://www.khanacademy.org/math/statistics-probability/significance-tests-one-sample/more-significance-testing-videos/v/small-sample-hypothesis-test

https://www.cs.cmu.edu/afs/cs/project/jair/pub/volume20/long03a-html/node64.html

https://en.wikipedia.org/wiki/Student%27s_t-test

https://images.app.goo.gl/YYikvpgQLjFBvQ7GA

https://mgimond.github.io/Stats-in-R/z_t_tests.html

https://medium.com/datadriveninvestor/p-value-t-test-chi-square-test-anova-when-to-use-which-strategy-32907734aa0e

https://sciencing.com/difference-between-ttest-chi-square-8225095.html

https://images.app.goo.gl/Z8ZZR7wPi7JBrht28

http://www-personal.umd.umich.edu/~acfoos/Courses/381/08%20-%20Hypothesis%20Testing%20with%20z%20Tests.pdf

https://images.app.goo.gl/esP8LkudidrCfabF6

http://facweb.cs.depaul.edu/sjost/csc423/documents/f-test-reg.htm

https://towardsdatascience.com/fisher-test-for-regression-analysis-1e1687867259

https://www.youtube.com/watch?v=kx-pcQAPvoc

Page updated

Google Sites

Report abuse