Machine learning training ISIBang

16th Apr 2022

Machine learning Vs Statistical learning

Machine learning needs hyper parameters tuning

Packages

Pandas
Seaborn
Mathplotlib
scipy

Pandas

Read file
1. mypd.read_excel
Covert data to DataFrame format
For Analysis of variance Among categorical
1. mypd..Crosstab

Preprocessing

Missing value
1. Mean
2. Median
Combining data
1. Merge
2. Append
Scaling
1. Min-max
2. Z-transform
3. Use sklearn for both of the above
4. Each non-categorical column is done with transform

Visualization

Pyplot
1. Box plot
2. Pie plot
3. Bar plot
4. Scatter

Seaborn
1. Pair plot
  1. Use hue if response is categorical

Table processing

Remove columns
1. iLoc
Remove rows
1. iLoc
Select rows
1. iloc
2. .columnname

Hyothesis test

T Test
Paired T test
Normality test
1. Q-Q test -> Check based on percentiles. If expected bell shape and actual bell shape concedes then it is bell shape, otherwise not
2. Wilk test
For all, if p-value > 0.05, then H0 is accepted, else rejected

Analysis of variance

Anova
Cross tabulation
1. Applied when both features are categorical
2. Similar to ANOVA, but we only use only count
Chi-square test
1. Identify Relation between two features

Correlation analysis

1. Between two variables
  1. Via scatter plot
  2. Via measuring linear relationship
    1. Pearson’s correlation coefficient
      1. If r>0 -> +ve correlation
      2. If r<0, -ve correlation
      3. If r ~ 0, no correlation

Module-7

Naive based classifier
1. It works on large amount of data

Module 6: Supervised learning

Video3: Bagging and Bootstrapping

Bootstrap is process where samples are picked in memoryless manner
1. Selection with replacement
It is used for random-forest sample collection
1. Regression - Average
2. Classification - Majority of vote
Bagging vs random forest
1. Bagging considers all p predictors
2. Random forest considers sqrt(p) predictors
All of above are called as ensemble model
1. In general ensemble model, different models can be used .. K-nn, CART, etc

============

Module 6: Supervised learning

Video1: CART (Classification and Regression tree)

DecisionTreeClassifier
DecisionTreeRegressor

Module 5: Unsupervised learning

Video 1: Cluster analysis

Agglomarative clustering
1. Hierarichy . number of groups are not given
k-means clustering
1. number of groups are given
Most of times, will use both
1. Agglomarative to identify number of clusters
  1. Similarity measure
    1. Euclidean Distance is popular
  2. Method
    1. Single linkage
2. k-means for forming clusters

=============

Module 5: Unsupervised learning

Video 1: Factor analysis

Factor analysis

It is unsupervised learning method
It is useful if variables/features are inter-related
1. If all variables are related, then less number of factors can be used for same number of variables
Correlation matrix helps to identify factors
min-max scaling is used for ML, z-tra
For large number of variables, corr matrix will not be useful
1. Use Bartlett test
  1. This test compares correlation with identity matrix
Factor anaysis based on eignvalues
1. only factors whose value >=1.0 is retained
2. after this, loadings will decide the factors
3. Rotation is needed in case, loadings are not near to factor lines
  1. Orthogonal rotation

================

https://drive.google.com/file/d/1rX0qXkDhp3ArmpJf921nVkSpUIyWPBGT/view

Video 5: Binary Logistic Regression 1

Used when response variable Y is categorical and also binary
1. Uses Logit function for making output as 0 or 1
  1. If value is close to 0, consider as 0
  2. If value is close to 1, consider as 1

TODO: How to compute the sensitivity etc

Q. For logistics binary regression, on what case, we will not need intercept

===========

https://drive.google.com/file/d/1Z7slh4kn1zVrRqwF6TVIphS_5kmrx-Zd/view

Video 4: Dummy Variable Regression

=============

Video 3: Linear Regression 2

Libraries

sklearn
1. cross_val_score

Cross-validation

Remove data subset for validation
Leave one out cross validation
1. Remove only one data point for validation

From https://drive.google.com/file/d/1uRiUZ6Fgkabrr6obgifgYdHvtZhMjhHv/view (Video 2: Linear Regression 1)

Libraries

scatter_matrix
1. For plotting scatter matrix
2. statsmodels
  1. Stats based models
    1. ols -Ordinary linear regression
3. scipy
  1. for stats library

Linear regression

Correlation analysis can identify strength of relationship. Regression analysis can identify what kind of relationship exists
1. Linear
2. Non-Linear
Simple linear regression
1. Uses least square estimation method for minimizing error
2. Multi-variate linear regression uses multiple variables
  1. Based on p-value of model.summary output, a variable can be dropped. p-value is for hypothesis that variable is zero
Model properties
1. Significance
  1. If model is significant, then result is better to the result without any model (for example, guesswork)
  2. p-value of F-statistics <0.05, then model is significant
2. Accuracy
  1. Checks if predicted value is closer to actual result
    1. R-square, Adjusted R-square, Min square error, root min square error
      1. R-square explains how much model can explain out of total info in the data
3. Adequacy
  1. Residuals (actual-predicted) will follow normal distribution (with mean=0) for good model
    1. In perfect case, all residuals will be zero, but it is not realistic expectation
4. Generlizable
  1. Model can predict well with new values

ISI ML course exam

=========

16th April 2022

=========

Next step:

Teach others
Advanced session?
1. It will be planned
  1. Forecasting
  2. NLP
  3. L1/L2 regularisation
ISI takes the consultancy work where ISI will give training/guidance (University collaboration)
1. There can be MOU for a year about training/guidance and also project work
2. Contact person -> Baby sir
3. Current year, ISI is giving training to Airbus
Doubt clarification
1. Mail to Boby sir. He will mention the time for video session
What is ROC curve? Read about this ……. Mentioned by Saurabh

Books

Introduction to statistical Learning with Applications in R - Springer
Elements of Statistical learning
Statistical learning from Regression perspective - Beck
Pattern recognition and Machine learning - Bishop

Greetings. Kindly share your linkedin URL or add me in your Linkedin network.

My linkedin URL is https://www.linkedin.com/in/dpkumar/

Tips for result validation

See that predicted class and also, predicted class probability in order to consulted before

Natural classifier Vs SVM
- SVM is natural classifier which allows misclassification
- For natural classifier, all points must be classified correctly. SVM will use epsilon to ignore misclassified points

Support Vector Machine

Can be done on numeric variable as well as categorical variable
It is used for classification
It uses the maximum marginal classifier
1. 1. Means that distance of classifier to closest point is maximum
Support vectors
1. These are points which are within the margin
Finding maximum margin is an optimisation problem
1. It uses linear equation

Q. What about noise in case of calculating margin?

Ans- It will impact. It should be taken care while data pre-processing as clean-up. This is true for all ML models. Since getting 100% noise free data is not possible, so we should visit back to data cleanup if we see accuracy is not good on real time data.

Q. What about multi-class classification?

Ans- yes. Methods are available using SVM

Q. What is purpose of Kernel in SVM?

Ans - It provides better flexibility and also better result.

Q. What about non-linear separation?

Ans - SVM supports non-linear

Q. Is SVM still in use?

Ans - yes. For classification, SVM gives best result most of the times.

Q. When to use logistic regression and when to use SVM?

Ans - Logistic regression is good for linear equation. For non-linear, SVM works better

Q. Can SVM be used for regression as well?

Ans - Yes

Q. Is scaling needed before drawing scatter plot?

Ans - No.

Q. Is scatter-plot same as pair-lot?

Ans - yes

Q.Sir, we are calculating accuracy on training data itself?

Ans - yes. It is to see that any noise causing issue. Expectation should be high value

Q. Sir, in my team, box plot itself is used for training the model. Is it normal practice?

Ans -

Q. Sir, how probabilities are calculated?

Ans - Need to understood

Q. Sir, how useful it is to read research papers for day-to-day work in industry?

Ans - In industry, only breakthrough papers are good. Most papers are the small improvement and so, might not be useful for industry

Q. How academic research is different than industry?

Ans - In industry, result is important and so only significant improvement is considered. In academic, any small improvement can also be submitted as research paper.

========

9th April 2022

========

CART (Classification And Regression Tree)

It works by partitioning the area and so, it creates a tree structure
- Partitioning too much can result into overfitting

Statistical learning Vs Machine learning

Machine learning techniques has better accuracy
Statistical learning are interpretable
Statistical learning technique has assumption like data distribution

CART vs Regression

CART has better accuracy compared to Linear Regression
1. It happened since CART can capture non-linear relationship as well.
CART is ML model and so, no assumption about data distribution

Classification vs Regression

If response is categorical, it is called classification problem
If response is continuous, it is called regression problem

Q. How confidence is built in ML models considering that it is not interpretable

Ans - Confidence comes based on measurement of model accuracy and generalisation capability

Q. Where statistical models are important?

Ans - At those places, where interpretable result is expected. For example, in medical system, it is important to know how the prescription worked or else side effect can happen unknowingly

Q. What is meant by interpretability?

Ans - It tells which feature is more important and which is less

=========

3rd April 2022

=========

Unsupervised learning

Factor analysis
- It is dimensionsality reduction technique
- Large number of correlated variables will be converted to a manageable number of uncorrelated or independent factors
- It groups similar columns
- Computation approach
  - For N Variables, make n linear relations
  - Using Eigen value analysis of them, we will get factors
  - Max number of factors can be N
- We can perform interpretation as well
Cluster analysis
- Example -> Group similar responses in a multi-questionaire survey
- Similar rows are grouped together
- Items in same group/cluster should be similar. And also, items in different clusters must be dissimilar
- Types
  - Hierarchical cluster
  - K-means cluster

Q. How the importance of a variable decided? Manual intervention needed?

Ans- Factor analysis can do it based on the data alone.

Q. What are harms if we don’t perform dimension reduction? Any impact on model accuracy as well?

Ans- It impacts the training time. Model accuracy will not get impacted since model will take care of it.

Q. Sir, Is it must to do the analysis on whole data? How much data will be good?

Ans-

=================

26th March 2022

=================

Supervised learning and unsupervised learning

Statistical learning techniques

- Statistical learning are of easy to interpret, but accuracy will not be good
- Example
  - Regression technique

Machine learning techniques

- machine learning techniques are more accurate, but not easy to interpret

Regression

Y can be
- numeric or
- it can be non-numeric (Categorical) -> true/false
Ols (Ordinary Least square) regression
- X and Y both must be numeric
Binary logistic regression
- X will be numeric and Y will be non-numeric
- Output 0 if p(Y) < 0.5, 1 if p(Y) > 0.5
- Logistic function is used since it gives better result

Properties of good model

Model signifiance
1. LLR (Log likelihood ratio test) < 0.5
Model accuracy
1. Good if Pseudo R-square > 0.6 and accuracy > 80 %
2. Confusion/prediction matrix is used for this
3. % of cases which are correctly predicted
Model generalizability
1. Handling unknown data
2. For this, test is done on new data

Confusion matrix

Sensitivity or Recall
- Out of total +ves, how many are actually detected as positive
Specificity
- Out of total -ves, how many are actually detected as negative
Precision
- Out of total detected positives, how many are really positives

Q. Sir, when will be advanced course announced?

Ans -

Q. How to get density for each t in order to plot t-statistic graph?

Q. For unsupervised learning, how detection on trained model can happen? For example K-means model

Q. Sir, for all classification based problem, y will be non-numeric(Categorical)?

Ans - Yes. Similarly, numeric based regression problems are called value based problem

Q. Sir, can regression equation be non-linear?

Ans - Yes. For example, it can be polynomial. However, computation will be complex. Python can handle such complexity. For non-linear case, machine learning models are popular since for regression, we need to arrive at formula which is difficult to do

Q. 9.799e-223 -> What e is here?

Ans - It is 9.799 * 10pow(-223)

Q. Sir, how to understand “Model significance” in laymen form?

Ans - It tells whether using model better compared to not using model (tossing coin etc)

Q. Do we need to ensure that each X feature should be independent to each other?

Ans -> Yes. X1 and X2 must be linearly independent(https://en.wikipedia.org/wiki/Multicollinearity) . It is important for Ols regression, not much important for logistic regression(Why?????)

Q. Sir, linear independence between variables can be checked using pair plot?

Ans ->

Q. How to handle linearly dependent variables?

Ans ->

Q. Sir, what is difference between Logit (statsmodels.api) and LogisticRegression (sklearn.linear_model) python packages?

Ans -> They are similar in functionality, but output is different

Q. What is default ratio for cross-validation?

Ans - 10%

=================

20th March 2022

Read about Statistical English

Ols -> Ordinary Least Square

Example of multi-variate hypothesis test

Compare average productivity for day shift with night shift

Analysis of variance (ANOVA)

Used for more than one variable

H0: Mean1 = Mean2 = Mean3 = ….. = MeanK

H1: H0 is not true

Data should be properly stacked
- Response should be mentioned as separate column

Q. Sir, could you please explain Reject H0. Is it degree of confidence?

Ans: - Yes. Reject HO if p < 0.05 means that you should reject H0 if p is below 0.05

Q. Is all variables need same number of values?

Ans - No since it works on average

Q. Sir, do we need to scale before performing between group comparison?

Ans- All group should have same metric parameter. For example, J2EE, C++ should have same metric example

Q. Sir, Any good book?

Ans -> Statistics by freedman. He is top-notch statistician in world. Also, he writes book in easy to read manner. https://homepages.dcc.ufmg.br/~assuncao/EstatCC/Slides/Extra/FPPExpObs.pdf

Greetings. HR contacted me. However some to and fro happening . I hope, HR will get time to connect next week. I will get back

Below Qs from assignment

Q. Sir, is it possible to do co-relation between categorical variables?

Ans -> Yes. Chi-square test. Normal scatter based plot should not be used since normally values for categorical variable is less in number

Q. Where to use pie-chart?

Ans -> For categorical test

Q. What is difference between bar plot and histogram?

Ans -> Use bar plot for categorical variable, for numeric value, use histogram

===========

Assignment1

For height and load, I am confused which one I should use-> Histogram or pie-chart or both?

height -> Should I use Histogram or pie-chart?

What is difference between bar plot and histogram?

Go through video about Boxplot

============

To resume -> Scaling

Kubernetes infra side projects

Go through https://www.coursera.org/learn/stanford-statistics/home/week/1

Week3

Hypothesis testing

Purpose to validate claim
- For example,
  - average cycle time is reduced to 24%
  - On the average, time to disperse cash is 5 mins
- It started by Fisher where he validated a lady claim that she can tell if sugar is added in tea after making tea or before
  - This is known as tea testing Lady problem https://www.youtube.com/watch?v=I9KsLCc-eiQ
    - https://www.youtube.com/watch?v=I9KsLCc-eiQ
Common Examples
- Mean equal to a specific value
  - First ensure that sample data is taken properly and represents the population
  - Crude approach is check if sample mean is close to mean
    - However closeness is relative and so, if we multiple each value by 1000(scaling), threshold can change
    - Also, if we change the data set, values can change
    - So, using mean on absolute value is not proper.
  - So, we need to standardise the difference. It is called T-Statistics Test
    - The error is this functions always follow T-distribution irrespective of data distribution
  - Also, changing sample can change the average
    - This will be taken care of by T-test
- Two means are same or not
- Two variances are same or not
Approach for testing
- Null hypothesis (H0)
- Alternate hypothesis (H1)
- H0 and H1 should be mutually exclusive and together complete

Testing approach

T-Statistics or test-statistic

Constraints to be satisfied for hypothesis testing

Claim should be quantifiable. Below can’t be validated
- Every cheque will be validated corrected
Claim should be over population, not on specific case
- A person will die at exactly this date
Shape/distribution of the population should be known
- There are hypothesis testing which can be used for this
- There are hypothesis testing for which shape knowledge is not needed(Called as non-parametric test). However, the perf of such techniques are not as good as parametric test (with shape)
Multiple alternate hypothesis is not allowed
- This is not limitation since multiple alternate can be converted to 0-1 problem
It is mandatory that H0 must contain =
- For example average cycle time is less than 24 hours
- H0 : Av(Cycle time) >=24
- H1: Av(Cycle time) < 24

Q. Sir, any real-life case which can’t be tested using test hypothesis approach?

Ans - Any case which satisfies constraints can be tested.

Q. Sir, any real-life case where test hypothesis is not needed?

Ans - Any case where all data (population) is available

Q. What sample size will be good for hypothesis testing?

Ans- More is better, but 15-20 will be good

Q. How to be sure that sampling truly follows population distribution?

Ans- Use hypothesis testing . Slight variance is fine

@. How to handle noise in case of determining population distribution?

Ans -

Q. How to know which distribution/shape is followed by problem at hand?

Ans- Use hypothesis testing . Slight variance is fine

Q. Sir, will it not be difficult to validate exact value?

Ans- No. Use T-test

Q. Sir, I was thinking to take claim always as H1. I was thinking that it will make easier to prove using proof by contradiction

Ans -

Q. If we have million sample data, then should we use t-statistics on full data?

Ans - In such case, it might be that data covers all cases. So, simple mean can be fine instead of t-stats for mean based hypothesis test

Q. Sir, will p and alpha be fixed?

Ans - Alpha defines the degree of confidence and generally it is 0.05 (95%)

Q. Can there be any case where no decision can be made?

Ans-

Q. Sir, using 0.05 alpha is for handling noise in this test?

Ans- Not necessarily. It tells that 95% cases, value will be close to mean

Q. Is it possible to have 100% confidence in hypothesis test?

Ans - No. In real life data, not possible since data may not have true representation of population and also noises etc

Q. What is assumption about t-test for data distribution?

Ans - Approximately normal. If not so, we shouldn’t use this test

=======

Week2

TODO: Ask for model based noise removal example after all classes

Data processing

Handle noisy data

Fillup missing values

Scale columns to common form

It is not mandatory.

Feature engineering

Feature selection

Feature generation or feature extraction

Q. Sir, Same scaling should be used for each column?

Ans - Yes. If scaling done, then same. For example, all features should have same scale 0-1

Q. Is there any case where columns don’t need scaling? Or it is mandatory?

Not mandatory. It is needed depending on the algorithm. For example, ANN needs scaling since it uses batch/iterative processing for training. So, scaling needed for all Algorithms where batch processing is needed
Regression doesn’t need scaling since it doesn’t use batch processing.

Q. How to handle missing value for non-numeric data?

Use mode (fill with most frequent)

Q. What is the performance of fill-in missing values baed on the model

Ans - Predict the value based on the model. It is not full-proof. The requirement is that variable/feature should be co-related with the other variables. Ideally variables should be independent and so, this approach will not work after feature selection

Q. How to handle the noisy data?

It is not straightforward. It can be done using domain knowledge. We can rectify based on the property of data as per the domain

Q. What is the normalisation purpose?

For ML, normalisation is same as scaling. This name is misnomer. In stats, normalisation is for transforming data to bell curve
There are two methods. One is z-transform and another is min-max

Q. Do we need scaling for categorical data?

No need.

Q. when to use min-max and when to use the z-transform, any guidance will be great

For statistical method, use z-transform. For ML technique, use min-max. For example, ANN will use min-max

============

Assignment - 25% to 30% mark

Exam -> At the end of the course (one hour multiple choice questions using google form)

Min -> 75% marks to pass

Every week -> Wednesday course video will be uploaded

Saturday half day session (3 hours like that.. 10 AM to 1 PM) on doubt clarification, and couple of sessions will be gone through. Live video will be uploaded thenafter

Alternate Saturday N Sunday class for weeks

Students

Working professionals

Working in ML and enhance career - 60%

Product development

Expecting JOB in ML - 10%

Just for curiosity -> 10%

Students - 20%

Students Profile

Statistics

Math

Engineering

Professor

My WhatsApp number is 9449 616 739. Sharing as I got request for my contact number. You can message me.

Session 1

Jyputer notebook keys

tab

shift + enter

shft + tab

Panda -> package for data processing

Percentile -> At what value, number below are of given percentile

Skew -> Skewness shows the shift of the maximum data from the centre line, for example if the most of the data is on the left side of mean line of the normally distributed curve , it means it is negatively skewed

kurtosis function is for kurtosis of the data, means peakedness of the data. For example, covid 3rd wave was having high peakedness compared to 1st covid wave

Boxplot -> Can be useful to detect outliers in data

https://medium.com/@maxtingle/10-jupyter-notebook-extensions-making-my-lyfe-easier-f40139a334ce -> Extensions Juypter notebook

9373222939 - Sumit Baj

9945266001 - Vijayakumar Baraguru

Vishal +91-8867679039

Arun Jyothi Reddy - 9912144614

Ramu 7204568893

Ajat@9845083994

8917494115 - Jiten

9742558667 - B H SREENIDHI

Bhaskar: 9964748666

Ayush Srivastava - +91-8888835845

Aditya - 9635755785

Those who wish to join the Whatsapp Group: join through the link : https://chat.whatsapp.com/IczwePn2bcIL9UCx5NphU0

Greetings. Whenever you can on or before Tuesday, please share the JIRA items for Sprint 14.2 as discussed. Pre-populated tasks are mentioned at . Wish to mention

Page updated

Google Sites

Report abuse