Gathering, describing and analysing data
Method to describe data so that it can be used to visualise data and later predict based on this data.
Descriptive statistics doesn't use probability. (Ref here)
Univariate
Location - Mean, median
Variation - Standard deviation, range
Bivariate
Co-variance, correlation
Multi-variate
Linear regression
Complete set of possible values
Subset of population
Average of value
Its sqrt sum of square to difference from mean value
Its sum of absolute distance from mean value divided by the number count. It is popular in machine learning
Ref: https://youtu.be/Vfo5le26IhY?t=3974 (Prof Abhinandan Sarkar)
Middle number once values are sorted
Its the graph which represents the values of characteristics in a population. For example, frequency of watermelon length for each possible length value makes a distribution
Distribution of sample means is called sample distribution
Sampling with random chance of each item.
Sample taken with systematic process will be biased since we have not given the chance to data for them to be selected themselves.
In probability theory, the law of large numbers (LLN) is a theorem that describes the result of performing the same experiment a large number of times. According to the law, the average of the results obtained from a large number of trials should be close to the expected value and will tend to become closer to the expected value as more trials are performed [Ref here]
Any survey is not statistical survey. Statistical survey needs criteria to be fulfilled
A person can make survey which appears as statistical which in reality will not
A person can process survey data in the form which speaks what's the person want, not what is actually true
Sample should be taken such a way that it should represent the population
For example, for telephone company survey feedback, if sample is taken from a specific city where people are happy with service, then the survey result will show positive feedback. On the other hand, if survey is taken from another city with different economic condition, then survey result will be different.
Prediction is dependent on the data at hand
Distribution of data brings prediction
Think that data didn't see a car coming and tell the person that its safe to cross the road. What will happen to the person? Similarly, an incomplete data brings wrong prediction.
Choice depends on the problem at hand
Outlier
A single very large data can skew mean value. For example, a person having very high salary can make average salary unrealistic
Median doesn't get impacted with a single very large data
If distribution has not many outliers, then use mean. Otherwise median will be better.
Also consider the problem in context. For example,
If you are interested to know income of most Indians, then use median. If you are interested in per capita income, then use mean.
If you are interested to know how much time most visitor spend in your website, then use median
Mean and median carries different kind of info
If you are analysing a day browsing history of yours,
then median tells time you spent in most of your browsing sites
If Mean value is much higher than median, then Mean tells if you have spent significant higher time for few websites.
Inter relation among multiple variables.
Example is height versus weight
It tells How a variable is related to another variable
If it is positive, then it means that both are moving in same direction
If it is negative, then it means that both are moving in opposite direction
If it is zero, then many things can happen??
Risk: Units of X and Y are different, then value can't be interpreted well.
Co-variance(X,X) = Square(Std deviation)
Co-variance retains unit for X and Y. For example, length unit will be mm, km etc. Change of unit affects the co-variance value
Correlation = co-variance divided by standard deviation of X and standard deviation of Y
Correlation removes unit from the co-variance. So, any unit doesn't affect the correlation value
Ref: https://youtu.be/Vfo5le26IhY?t=10837
Analyse the data one variable at a time
Probably -> From Statistics
Approximate -> From machine learning
Correct
Learning
This is about minimising the square sum of distance from prediction to the actual value. Prediction can be a line. This can be either neural network or SVM
Ref: https://youtu.be/Vfo5le26IhY?t=14270
To minimise, differential calculus derivative can be used.
Train system based on labelled data. Least squares can be used for training.
Consider that a sale person has 10% of selling product. Then one a typical day, what is the chance to sell product to two customers. Answer of this also depends on number of customer is visited in that day.
This question is similar to finding 2 defectives out of 3 samples.
Ref: https://youtu.be/Vfo5le26IhY?t=17283
If two events are independent to each other, then they are called mutually exclusive.
For example, being defective and not defective are mutually exclusive.
Probability based on knowledge that an event already occurred. For example, probability of selling computer peripheral knowing that customer is IT professional
Ref: https://youtu.be/Vfo5le26IhY?t=18445
Marginal probability is the probability of an event irrespective of the outcome of another variable.
Ref: https://machinelearningmastery.com/joint-marginal-and-conditional-probability-for-machine-learning/
If two events are independent, then P(A and B) = P(A)P(B)
P(A|B) = [P(B|A)P(A)]/P(B) = [P(B|A)P(A)]/ [P(B|A)P(A) + P(B| not A)P(not A)]
Probability (Spam | congratulations) = [P(congrat | Spam) P (Spam)] / [P (congrats| Spam) P(Spam) + P(not congrats) P(not Spam)]
Ref: https://en.wikipedia.org/wiki/Bayes%27_theorem
Example use - Used for finding probable number of escalations in a customer care. Here N is given and so, it can be viewed as proportion out of N
For a N=7 trials, expected number of success will be 7 * .6 = 4.2 where .6 is the probability of success
Ref: https://youtu.be/Vfo5le26IhY?t=23423
Here N doesn't make sense. It gives the rate of event occurrence.
Example:
Find the total number of fraud case.
Find the total number of cracks in a bottle.
How many eggs a chicken can give
If, on average, 2 customers arrive in a bank per minute (on a busy day), what is the probability that 4 customers can arrive in a given minute? What is the probability that 3 customers can arrive in a given minute?
Ref: https://youtu.be/Vfo5le26IhY?t=24007
https://www.youtube.com/watch?v=Vfo5le26IhY&t=10116s
Identify the distribution first
If mean and deviation is given and if distribution is normal, then it is enough data to compute probability.
Collect right sample to get feel of data. (Ref: https://youtu.be/Kh8oTnHzugE?t=294)
Sample should be random (unbiased)
A right sample can also help to understand distribution?
More the sample count, lesser the sampling error
https://www.youtube.com/watch?v=bXrvHkbByik&list=PLnVYEpTNGNtXTmmcpa60hHL-LT4Hynoss