16th Apr 2022
Machine learning Vs Statistical learning
Machine learning needs hyper parameters tuning
Packages
Pandas
Seaborn
Mathplotlib
scipy
Pandas
Read file
mypd.read_excel
Covert data to DataFrame format
For Analysis of variance Among categorical
mypd..Crosstab
Preprocessing
Missing value
Mean
Median
Combining data
Merge
Append
Scaling
Min-max
Z-transform
Use sklearn for both of the above
Each non-categorical column is done with transform
Visualization
Pyplot
Box plot
Pie plot
Bar plot
Scatter
Seaborn
Pair plot
Use hue if response is categorical
Table processing
Remove columns
iLoc
Remove rows
iLoc
Select rows
iloc
.columnname
Hyothesis test
T Test
Paired T test
Normality test
Q-Q test -> Check based on percentiles. If expected bell shape and actual bell shape concedes then it is bell shape, otherwise not
Wilk test
For all, if p-value > 0.05, then H0 is accepted, else rejected
Analysis of variance
Anova
Cross tabulation
Applied when both features are categorical
Similar to ANOVA, but we only use only count
Chi-square test
Identify Relation between two features
Correlation analysis
Between two variables
Via scatter plot
Via measuring linear relationship
Pearson’s correlation coefficient
If r>0 -> +ve correlation
If r<0, -ve correlation
If r ~ 0, no correlation
Module-7
Naive based classifier
It works on large amount of data
Module 6: Supervised learning
Video3: Bagging and Bootstrapping
Bootstrap is process where samples are picked in memoryless manner
Selection with replacement
It is used for random-forest sample collection
Regression - Average
Classification - Majority of vote
Bagging vs random forest
Bagging considers all p predictors
Random forest considers sqrt(p) predictors
All of above are called as ensemble model
In general ensemble model, different models can be used .. K-nn, CART, etc
============
Module 6: Supervised learning
Video1: CART (Classification and Regression tree)
DecisionTreeClassifier
DecisionTreeRegressor
Module 5: Unsupervised learning
Video 1: Cluster analysis
Agglomarative clustering
Hierarichy . number of groups are not given
k-means clustering
number of groups are given
Most of times, will use both
Agglomarative to identify number of clusters
Similarity measure
Euclidean Distance is popular
Method
Single linkage
k-means for forming clusters
=============
Module 5: Unsupervised learning
Video 1: Factor analysis
Factor analysis
It is unsupervised learning method
It is useful if variables/features are inter-related
If all variables are related, then less number of factors can be used for same number of variables
Correlation matrix helps to identify factors
min-max scaling is used for ML, z-tra
For large number of variables, corr matrix will not be useful
Use Bartlett test
This test compares correlation with identity matrix
Factor anaysis based on eignvalues
only factors whose value >=1.0 is retained
after this, loadings will decide the factors
Rotation is needed in case, loadings are not near to factor lines
Orthogonal rotation
================
https://drive.google.com/file/d/1rX0qXkDhp3ArmpJf921nVkSpUIyWPBGT/view
Video 5: Binary Logistic Regression 1
Used when response variable Y is categorical and also binary
Uses Logit function for making output as 0 or 1
If value is close to 0, consider as 0
If value is close to 1, consider as 1
TODO: How to compute the sensitivity etc
Q. For logistics binary regression, on what case, we will not need intercept
===========
https://drive.google.com/file/d/1Z7slh4kn1zVrRqwF6TVIphS_5kmrx-Zd/view
Video 4: Dummy Variable Regression
=============
Video 3: Linear Regression 2
Libraries
sklearn
cross_val_score
Cross-validation
Remove data subset for validation
Leave one out cross validation
Remove only one data point for validation
From https://drive.google.com/file/d/1uRiUZ6Fgkabrr6obgifgYdHvtZhMjhHv/view (Video 2: Linear Regression 1)
Libraries
scatter_matrix
For plotting scatter matrix
statsmodels
Stats based models
ols -Ordinary linear regression
scipy
for stats library
Linear regression
Correlation analysis can identify strength of relationship. Regression analysis can identify what kind of relationship exists
Linear
Non-Linear
Simple linear regression
Uses least square estimation method for minimizing error
Multi-variate linear regression uses multiple variables
Based on p-value of model.summary output, a variable can be dropped. p-value is for hypothesis that variable is zero
Model properties
Significance
If model is significant, then result is better to the result without any model (for example, guesswork)
p-value of F-statistics <0.05, then model is significant
Accuracy
Checks if predicted value is closer to actual result
R-square, Adjusted R-square, Min square error, root min square error
R-square explains how much model can explain out of total info in the data
Adequacy
Residuals (actual-predicted) will follow normal distribution (with mean=0) for good model
In perfect case, all residuals will be zero, but it is not realistic expectation
Generlizable
Model can predict well with new values
ISI ML course exam
=========
16th April 2022
=========
Next step:
Teach others
Advanced session?
It will be planned
Forecasting
NLP
L1/L2 regularisation
ISI takes the consultancy work where ISI will give training/guidance (University collaboration)
There can be MOU for a year about training/guidance and also project work
Contact person -> Baby sir
Current year, ISI is giving training to Airbus
Doubt clarification
Mail to Boby sir. He will mention the time for video session
What is ROC curve? Read about this ……. Mentioned by Saurabh
Books
Introduction to statistical Learning with Applications in R - Springer
Elements of Statistical learning
Statistical learning from Regression perspective - Beck
Pattern recognition and Machine learning - Bishop
Greetings. Kindly share your linkedin URL or add me in your Linkedin network.
My linkedin URL is https://www.linkedin.com/in/dpkumar/
Tips for result validation
See that predicted class and also, predicted class probability in order to consulted before
Natural classifier Vs SVM
SVM is natural classifier which allows misclassification
For natural classifier, all points must be classified correctly. SVM will use epsilon to ignore misclassified points
Support Vector Machine
Can be done on numeric variable as well as categorical variable
It is used for classification
It uses the maximum marginal classifier
Means that distance of classifier to closest point is maximum
Support vectors
These are points which are within the margin
Finding maximum margin is an optimisation problem
It uses linear equation
Q. What about noise in case of calculating margin?
Ans- It will impact. It should be taken care while data pre-processing as clean-up. This is true for all ML models. Since getting 100% noise free data is not possible, so we should visit back to data cleanup if we see accuracy is not good on real time data.
Q. What about multi-class classification?
Ans- yes. Methods are available using SVM
Q. What is purpose of Kernel in SVM?
Ans - It provides better flexibility and also better result.
Q. What about non-linear separation?
Ans - SVM supports non-linear
Q. Is SVM still in use?
Ans - yes. For classification, SVM gives best result most of the times.
Q. When to use logistic regression and when to use SVM?
Ans - Logistic regression is good for linear equation. For non-linear, SVM works better
Q. Can SVM be used for regression as well?
Ans - Yes
Q. Is scaling needed before drawing scatter plot?
Ans - No.
Q. Is scatter-plot same as pair-lot?
Ans - yes
Q.Sir, we are calculating accuracy on training data itself?
Ans - yes. It is to see that any noise causing issue. Expectation should be high value
Q. Sir, in my team, box plot itself is used for training the model. Is it normal practice?
Ans -
Q. Sir, how probabilities are calculated?
Ans - Need to understood
Q. Sir, how useful it is to read research papers for day-to-day work in industry?
Ans - In industry, only breakthrough papers are good. Most papers are the small improvement and so, might not be useful for industry
Q. How academic research is different than industry?
Ans - In industry, result is important and so only significant improvement is considered. In academic, any small improvement can also be submitted as research paper.
========
9th April 2022
========
CART (Classification And Regression Tree)
It works by partitioning the area and so, it creates a tree structure
Partitioning too much can result into overfitting
Statistical learning Vs Machine learning
Machine learning techniques has better accuracy
Statistical learning are interpretable
Statistical learning technique has assumption like data distribution
CART vs Regression
CART has better accuracy compared to Linear Regression
It happened since CART can capture non-linear relationship as well.
CART is ML model and so, no assumption about data distribution
Classification vs Regression
If response is categorical, it is called classification problem
If response is continuous, it is called regression problem
Q. How confidence is built in ML models considering that it is not interpretable
Ans - Confidence comes based on measurement of model accuracy and generalisation capability
Q. Where statistical models are important?
Ans - At those places, where interpretable result is expected. For example, in medical system, it is important to know how the prescription worked or else side effect can happen unknowingly
Q. What is meant by interpretability?
Ans - It tells which feature is more important and which is less
=========
3rd April 2022
=========
Unsupervised learning
Factor analysis
It is dimensionsality reduction technique
Large number of correlated variables will be converted to a manageable number of uncorrelated or independent factors
It groups similar columns
Computation approach
For N Variables, make n linear relations
Using Eigen value analysis of them, we will get factors
Max number of factors can be N
We can perform interpretation as well
Cluster analysis
Example -> Group similar responses in a multi-questionaire survey
Similar rows are grouped together
Items in same group/cluster should be similar. And also, items in different clusters must be dissimilar
Types
Hierarchical cluster
K-means cluster
Q. How the importance of a variable decided? Manual intervention needed?
Ans- Factor analysis can do it based on the data alone.
Q. What are harms if we don’t perform dimension reduction? Any impact on model accuracy as well?
Ans- It impacts the training time. Model accuracy will not get impacted since model will take care of it.
Q. Sir, Is it must to do the analysis on whole data? How much data will be good?
Ans-
=================
26th March 2022
=================
Supervised learning and unsupervised learning
Statistical learning techniques
Statistical learning are of easy to interpret, but accuracy will not be good
Example
Regression technique
Machine learning techniques
machine learning techniques are more accurate, but not easy to interpret
Regression
Y can be
numeric or
it can be non-numeric (Categorical) -> true/false
Ols (Ordinary Least square) regression
X and Y both must be numeric
Binary logistic regression
X will be numeric and Y will be non-numeric
Output 0 if p(Y) < 0.5, 1 if p(Y) > 0.5
Logistic function is used since it gives better result
Properties of good model
Model signifiance
LLR (Log likelihood ratio test) < 0.5
Model accuracy
Good if Pseudo R-square > 0.6 and accuracy > 80 %
Confusion/prediction matrix is used for this
% of cases which are correctly predicted
Model generalizability
Handling unknown data
For this, test is done on new data
Confusion matrix
Sensitivity or Recall
Out of total +ves, how many are actually detected as positive
Specificity
Out of total -ves, how many are actually detected as negative
Precision
Out of total detected positives, how many are really positives
Q. Sir, when will be advanced course announced?
Ans -
Q. How to get density for each t in order to plot t-statistic graph?
Q. For unsupervised learning, how detection on trained model can happen? For example K-means model
Q. Sir, for all classification based problem, y will be non-numeric(Categorical)?
Ans - Yes. Similarly, numeric based regression problems are called value based problem
Q. Sir, can regression equation be non-linear?
Ans - Yes. For example, it can be polynomial. However, computation will be complex. Python can handle such complexity. For non-linear case, machine learning models are popular since for regression, we need to arrive at formula which is difficult to do
Q. 9.799e-223 -> What e is here?
Ans - It is 9.799 * 10pow(-223)
Q. Sir, how to understand “Model significance” in laymen form?
Ans - It tells whether using model better compared to not using model (tossing coin etc)
Q. Do we need to ensure that each X feature should be independent to each other?
Ans -> Yes. X1 and X2 must be linearly independent(https://en.wikipedia.org/wiki/Multicollinearity) . It is important for Ols regression, not much important for logistic regression(Why?????)
Q. Sir, linear independence between variables can be checked using pair plot?
Ans ->
Q. How to handle linearly dependent variables?
Ans ->
Q. Sir, what is difference between Logit (statsmodels.api) and LogisticRegression (sklearn.linear_model) python packages?
Ans -> They are similar in functionality, but output is different
Q. What is default ratio for cross-validation?
Ans - 10%
=================
=================
20th March 2022
Read about Statistical English
Ols -> Ordinary Least Square
Example of multi-variate hypothesis test
Compare average productivity for day shift with night shift
Analysis of variance (ANOVA)
Used for more than one variable
H0: Mean1 = Mean2 = Mean3 = ….. = MeanK
H1: H0 is not true
Data should be properly stacked
Response should be mentioned as separate column
Q. Sir, could you please explain Reject H0. Is it degree of confidence?
Ans: - Yes. Reject HO if p < 0.05 means that you should reject H0 if p is below 0.05
Q. Is all variables need same number of values?
Ans - No since it works on average
Q. Sir, do we need to scale before performing between group comparison?
Ans- All group should have same metric parameter. For example, J2EE, C++ should have same metric example
Q. Sir, Any good book?
Ans -> Statistics by freedman. He is top-notch statistician in world. Also, he writes book in easy to read manner. https://homepages.dcc.ufmg.br/~assuncao/EstatCC/Slides/Extra/FPPExpObs.pdf
Greetings. HR contacted me. However some to and fro happening . I hope, HR will get time to connect next week. I will get back
Below Qs from assignment
Q. Sir, is it possible to do co-relation between categorical variables?
Ans -> Yes. Chi-square test. Normal scatter based plot should not be used since normally values for categorical variable is less in number
Q. Where to use pie-chart?
Ans -> For categorical test
Q. What is difference between bar plot and histogram?
Ans -> Use bar plot for categorical variable, for numeric value, use histogram
===========
Assignment1
For height and load, I am confused which one I should use-> Histogram or pie-chart or both?
height -> Should I use Histogram or pie-chart?
What is difference between bar plot and histogram?
Go through video about Boxplot
============
To resume -> Scaling
Kubernetes infra side projects
Go through https://www.coursera.org/learn/stanford-statistics/home/week/1
Week3
Hypothesis testing
Purpose to validate claim
For example,
average cycle time is reduced to 24%
On the average, time to disperse cash is 5 mins
It started by Fisher where he validated a lady claim that she can tell if sugar is added in tea after making tea or before
This is known as tea testing Lady problem https://www.youtube.com/watch?v=I9KsLCc-eiQ
Common Examples
Mean equal to a specific value
First ensure that sample data is taken properly and represents the population
Crude approach is check if sample mean is close to mean
However closeness is relative and so, if we multiple each value by 1000(scaling), threshold can change
Also, if we change the data set, values can change
So, using mean on absolute value is not proper.
So, we need to standardise the difference. It is called T-Statistics Test
The error is this functions always follow T-distribution irrespective of data distribution
Also, changing sample can change the average
This will be taken care of by T-test
Two means are same or not
Two variances are same or not
Approach for testing
Null hypothesis (H0)
Alternate hypothesis (H1)
H0 and H1 should be mutually exclusive and together complete
Testing approach
T-Statistics or test-statistic
Constraints to be satisfied for hypothesis testing
Claim should be quantifiable. Below can’t be validated
Every cheque will be validated corrected
Claim should be over population, not on specific case
A person will die at exactly this date
Shape/distribution of the population should be known
There are hypothesis testing which can be used for this
There are hypothesis testing for which shape knowledge is not needed(Called as non-parametric test). However, the perf of such techniques are not as good as parametric test (with shape)
Multiple alternate hypothesis is not allowed
This is not limitation since multiple alternate can be converted to 0-1 problem
It is mandatory that H0 must contain =
For example average cycle time is less than 24 hours
H0 : Av(Cycle time) >=24
H1: Av(Cycle time) < 24
Q. Sir, any real-life case which can’t be tested using test hypothesis approach?
Ans - Any case which satisfies constraints can be tested.
Q. Sir, any real-life case where test hypothesis is not needed?
Ans - Any case where all data (population) is available
Q. What sample size will be good for hypothesis testing?
Ans- More is better, but 15-20 will be good
Q. How to be sure that sampling truly follows population distribution?
Ans- Use hypothesis testing . Slight variance is fine
@. How to handle noise in case of determining population distribution?
Ans -
Q. How to know which distribution/shape is followed by problem at hand?
Ans- Use hypothesis testing . Slight variance is fine
Q. Sir, will it not be difficult to validate exact value?
Ans- No. Use T-test
Q. Sir, I was thinking to take claim always as H1. I was thinking that it will make easier to prove using proof by contradiction
Ans -
Q. If we have million sample data, then should we use t-statistics on full data?
Ans - In such case, it might be that data covers all cases. So, simple mean can be fine instead of t-stats for mean based hypothesis test
Q. Sir, will p and alpha be fixed?
Ans - Alpha defines the degree of confidence and generally it is 0.05 (95%)
Q. Can there be any case where no decision can be made?
Ans-
Q. Sir, using 0.05 alpha is for handling noise in this test?
Ans- Not necessarily. It tells that 95% cases, value will be close to mean
Q. Is it possible to have 100% confidence in hypothesis test?
Ans - No. In real life data, not possible since data may not have true representation of population and also noises etc
Q. What is assumption about t-test for data distribution?
Ans - Approximately normal. If not so, we shouldn’t use this test
=======
Week2
TODO: Ask for model based noise removal example after all classes
Data processing
Handle noisy data
Fillup missing values
Scale columns to common form
It is not mandatory.
Feature engineering
Feature selection
Feature generation or feature extraction
Q. Sir, Same scaling should be used for each column?
Ans - Yes. If scaling done, then same. For example, all features should have same scale 0-1
Q. Is there any case where columns don’t need scaling? Or it is mandatory?
Not mandatory. It is needed depending on the algorithm. For example, ANN needs scaling since it uses batch/iterative processing for training. So, scaling needed for all Algorithms where batch processing is needed
Regression doesn’t need scaling since it doesn’t use batch processing.
Q. How to handle missing value for non-numeric data?
Use mode (fill with most frequent)
Q. What is the performance of fill-in missing values baed on the model
Ans - Predict the value based on the model. It is not full-proof. The requirement is that variable/feature should be co-related with the other variables. Ideally variables should be independent and so, this approach will not work after feature selection
Q. How to handle the noisy data?
It is not straightforward. It can be done using domain knowledge. We can rectify based on the property of data as per the domain
Q. What is the normalisation purpose?
For ML, normalisation is same as scaling. This name is misnomer. In stats, normalisation is for transforming data to bell curve
There are two methods. One is z-transform and another is min-max
Q. Do we need scaling for categorical data?
No need.
Q. when to use min-max and when to use the z-transform, any guidance will be great
For statistical method, use z-transform. For ML technique, use min-max. For example, ANN will use min-max
============
Assignment - 25% to 30% mark
Exam -> At the end of the course (one hour multiple choice questions using google form)
Min -> 75% marks to pass
Every week -> Wednesday course video will be uploaded
Saturday half day session (3 hours like that.. 10 AM to 1 PM) on doubt clarification, and couple of sessions will be gone through. Live video will be uploaded thenafter
Alternate Saturday N Sunday class for weeks
Students
Working professionals
Working in ML and enhance career - 60%
Product development
Expecting JOB in ML - 10%
Just for curiosity -> 10%
Students - 20%
Students Profile
Statistics
Math
Engineering
Professor
My WhatsApp number is 9449 616 739. Sharing as I got request for my contact number. You can message me.
Session 1
Jyputer notebook keys
tab
shift + enter
shft + tab
Panda -> package for data processing
Percentile -> At what value, number below are of given percentile
Skew -> Skewness shows the shift of the maximum data from the centre line, for example if the most of the data is on the left side of mean line of the normally distributed curve , it means it is negatively skewed
kurtosis function is for kurtosis of the data, means peakedness of the data. For example, covid 3rd wave was having high peakedness compared to 1st covid wave
Boxplot -> Can be useful to detect outliers in data
https://medium.com/@maxtingle/10-jupyter-notebook-extensions-making-my-lyfe-easier-f40139a334ce -> Extensions Juypter notebook
9373222939 - Sumit Baj
9945266001 - Vijayakumar Baraguru
Vishal +91-8867679039
Arun Jyothi Reddy - 9912144614
Ramu 7204568893
Ajat@9845083994
8917494115 - Jiten
9742558667 - B H SREENIDHI
Bhaskar: 9964748666
Ayush Srivastava - +91-8888835845
Aditya - 9635755785
Those who wish to join the Whatsapp Group: join through the link : https://chat.whatsapp.com/IczwePn2bcIL9UCx5NphU0
Greetings. Whenever you can on or before Tuesday, please share the JIRA items for Sprint 14.2 as discussed. Pre-populated tasks are mentioned at . Wish to mention