[TBD]Basic of statistics with machine learning

Terminology

Statistics

Gathering, describing and analysing data

Descriptive statistics

Method to describe data so that it can be used to visualise data and later predict based on this data.

Descriptive statistics doesn't use probability. (Ref here)

- Univariate
  - Location - Mean, median
  - Variation - Standard deviation, range
- Bivariate
  - Co-variance, correlation
- Multi-variate
  - Linear regression

Population

Complete set of possible values

Sample

Subset of population

Mean

Average of value

Standard deviation

Its sqrt sum of square to difference from mean value

Mean absolute deviation

Its sum of absolute distance from mean value divided by the number count. It is popular in machine learning

Ref: https://youtu.be/Vfo5le26IhY?t=3974 (Prof Abhinandan Sarkar)

Median

Middle number once values are sorted

Distribution

Its the graph which represents the values of characteristics in a population. For example, frequency of watermelon length for each possible length value makes a distribution

Sample distribution

Distribution of sample means is called sample distribution

Unbiased sampling

Sampling with random chance of each item.

- Sample taken with systematic process will be biased since we have not given the chance to data for them to be selected themselves.

Law of large numbers

In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed [Ref here]

Be careful

- Any survey is not statistical survey. Statistical survey needs criteria to be fulfilled
- A person can make survey which appears as statistical which in reality will not
- A person can process survey data in the form which speaks what's the person want, not what is actually true

Statistical criteria

- Sample should be taken such a way that it should represent the population
  - For example, for telephone company survey feedback, if sample is taken from a specific city where people are happy with service, then the survey result will show positive feedback. On the other hand, if survey is taken from another city with different economic condition, then survey result will be different.
- Prediction is dependent on the data at hand
  - Distribution of data brings prediction
  - Think that data didn't see a car coming and tell the person that its safe to cross the road. What will happen to the person? Similarly, an incomplete data brings wrong prediction.

Mean vs median

Choice depends on the problem at hand

- Outlier
  - A single very large data can skew mean value. For example, a person having very high salary can make average salary unrealistic
  - Median doesn't get impacted with a single very large data

Statistical tips

If distribution has not many outliers, then use mean. Otherwise median will be better.
Also consider the problem in context. For example,
- If you are interested to know income of most Indians, then use median. If you are interested in per capita income, then use mean.
- If you are interested to know how much time most visitor spend in your website, then use median
Mean and median carries different kind of info
- If you are analysing a day browsing history of yours,
  - then median tells time you spent in most of your browsing sites
  - If Mean value is much higher than median, then Mean tells if you have spent significant higher time for few websites.

Co-variance

Inter relation among multiple variables.
- - Example is height versus weight
It tells How a variable is related to another variable

- If it is positive, then it means that both are moving in same direction
- If it is negative, then it means that both are moving in opposite direction
- If it is zero, then many things can happen??
- Risk: Units of X and Y are different, then value can't be interpreted well.
- Co-variance(X,X) = Square(Std deviation)
- Co-variance retains unit for X and Y. For example, length unit will be mm, km etc. Change of unit affects the co-variance value

Correlation

Correlation = co-variance divided by standard deviation of X and standard deviation of Y

- Correlation removes unit from the co-variance. So, any unit doesn't affect the correlation value

Ref: https://youtu.be/Vfo5le26IhY?t=10837

Univariate analysis

Analyse the data one variable at a time

Statistics in Machine Learning

PAC learning

Probably -> From Statistics

Approximate -> From machine learning

Correct

Learning

MAD

Mean absolute Deviation is used for machine learning

Least squares

This is about minimising the square sum of distance from prediction to the actual value. Prediction can be a line. This can be either neural network or SVM

Ref: https://youtu.be/Vfo5le26IhY?t=14270

To minimise, differential calculus derivative can be used.

Training

Train system based on labelled data. Least squares can be used for training.

Statistical learning

Probability of occurrence

Consider that a sale person has 10% of selling product. Then one a typical day, what is the chance to sell product to two customers. Answer of this also depends on number of customer is visited in that day.

This question is similar to finding 2 defectives out of 3 samples.

Ref: https://youtu.be/Vfo5le26IhY?t=17283

Mutually exclusive event

If two events are independent to each other, then they are called mutually exclusive.

For example, being defective and not defective are mutually exclusive.

Conditional probability

Probability based on knowledge that an event already occurred. For example, probability of selling computer peripheral knowing that customer is IT professional

Ref: https://youtu.be/Vfo5le26IhY?t=18445

Marginal probability

Marginal probability is the probability of an event irrespective of the outcome of another variable.

Ref: https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/

Multiplication rule

If two events are independent, then P(A and B) = P(A)P(B)

Bayes theorem

Probability (Spam | congratulations) = [P(congrat | Spam) P (Spam)] / [P (congrats| Spam) P(Spam) + P(not congrats) P(not Spam)]

Ref: https://en.wikipedia.org/wiki/Bayes%27_theorem

Binomial distribution

Example use - Used for finding probable number of escalations in a customer care. Here N is given and so, it can be viewed as proportion out of N

For a N=7 trials, expected number of success will be 7 * .6 = 4.2 where .6 is the probability of success

Ref: https://youtu.be/Vfo5le26IhY?t=23423

Poisson distribution

Here N doesn't make sense. It gives the rate of event occurrence.

Example:

Find the total number of fraud case.
Find the total number of cracks in a bottle.
How many eggs a chicken can give
- If, on average, 2 customers arrive in a bank per minute (on a busy day), what is the probability that 4 customers can arrive in a given minute? What is the probability that 3 customers can arrive in a given minute?

Ref: https://youtu.be/Vfo5le26IhY?t=24007

Prof Abhinanda Sarkar lecture (Indian Statistical Institute)

https://www.youtube.com/watch?v=Vfo5le26IhY&t=10116s

Practical input for handling probability cases

- Identify the distribution first
  - If mean and deviation is given and if distribution is normal, then it is enough data to compute probability.

Statistics tips

- Collect right sample to get feel of data. (Ref: https://youtu.be/Kh8oTnHzugE?t=294)
  - Sample should be random (unbiased)
- A right sample can also help to understand distribution?

More the sample count, lesser the sampling error

Reference

https://www.youtube.com/watch?v=bXrvHkbByik&list=PLnVYEpTNGNtXTmmcpa60hHL-LT4Hynoss