Basic Statistics for Deep Learning

Basic Statistics for Deep Learning

Statistical Measures

Example: Consider a multiset data:

2,1,3,2,2,14

In frequency table it can be written as

The data value 2 has weight of 3 as it occurs thrice i.e. frequency is 3.


Mean: It is the average value.


Probability: The probability is the likelihood of something happening. It is always between 0 (impossible) and 1 (certain)

The probability of getting the single outcome 7 rolling a fair dice is 0 where as the probability of death is 1.

Let's consider an fair coin tossing experiment. We want of find the probability of getting head (an event). With one coin toss, head can happen only once but the total outcomes can be both head or tail.

Pr(head) = Number of ways event can happen/Total number of outcomes of the experiment = 1/2 = 0.5

So, there is a 50/50 chances for both head and tail.

Let's associate each data value with probabilities in the above example multiset data  (2,1,3,2,2,14). The probabilities totals to 1.

Probability of 2 i.e. Pr(2) = 3/6 and for other it is 1/6. So, 3/6 + 3*1/6 = 6/6 = 1

Expected Value: It is the mathematical expectation or the average value.

Median: Median is the number that separates the higher half of a data sample, a population, or a probability distribution, from the lower half.

Let's arrange the dataset in the ascending order i.e. 1,2,2,2,3,14

There as two middle values in the 3rd and 4th position from the left which are separators. The real value of the separator in this even number of  dataset is the mean of these two middle values at n/2th and (n/2 + 1)th position i.e.

2+2/2 = 2

For odd data number of dataset it will be n/2 + 1. The median seems to be more robust to outliers than mean.

Mode: Mode is the value that appears most in the dataset i.e. the highest frequency element. In the given data the frequency for 2 is the highest i.e. 3 and hence 2 is the mode. Note that for a normal distribution, mean, median, and mode are the same.

Range: Range in a data set is the difference between the largest and the smallest value. In the given data set, it is:

14 - 1 = 13

Variance: Variance is the measure of how far the set of number in a dataset are spread out. The small variance implies that the data points tend to be closer to mean and vice versa. In the given data set, deviations of numbers from mean are:

Deviation from mean of X = {1-4 = -3, 2 - 4=-2, 3-4=-1, 14-4=10}

In terms of Expected value, variance can be written as:

If all the numbers had been identical, there will be no deviation and hence variance would be zero. The mathematical term for variance is for mathematical conveniences. We could have used mod instead of a square. The advantage of using a squared function is that you can take the derivative or apply an integral (e.g. easy to find minima or maxima). This makes it convenient to work with inside proofs, solving equations analytically. Because of the square in RMSE (Root Mean Square Error) more focus is put on outliers i.e. outliers are magnified by squaring them.

Standard Deviation: It quantifies the amount of dispersion in a dataset. It is a square root of a variance. Unlike the variance, it is expressed in the same units as the data. The standard deviation is commonly used to measure confidence in statistical conclusions. In the given dataset, SD is:


Moments: Moment is the quantity times the distance from a fixed reference point. If the data points represent probability density, then the zeroth moment is the total probability (i.e. one), the first moment is the mean, the second central moment is the variance, the third moment is the skewness, and the fourth moment (with normalization and shift) is the kurtosis. Skewness is the measure of asymmetry of the probability distribution of real valued random variable about its mean. Kurtosis is the degree of peakedness of the distribution which is defined as the normalized form of the fourth centered moment of a distribution.

Quartiles and Inter Quartile Range (IQR): Like median which separates the datasets in two halves, quartile divides dataset into four equal parts with three points Q1, Q2, and Q3 where Q2 is actually a median. IQR is the difference between Q3 and Q1.

IQR = Q3 - Q1

In the given dataset, Q2 = 2. So, for lower half of data 1,2,2, Q1 = median = 2. For the upper half 2,2,14, Q3 = 2. Then, IQR = 2-2 = 0. IQR may be used to characterize the data when there may be outliers that skew the data. The fences for the outliers are calculated using the following formula:

Lower Fence = Q1 - 1.5IQR = 2

Upper Fence = Q3 + 1.5IQR = 2

So, 14 is outlier for the given dataset.

Odds, Odd Ratio, Log of Odds: For an event with a given probability, p, the corresponding odds can be considered as the number of success occurrences you expect to get for every failure on average.

. Then,

Odd Ratio (OR) is the ration of two Odds. Log of Odds is the transformation from probabilities to log of odds to get around with restricted range of [0  1] to [-inf inf]. It is also easy to interpret and understand. It is represented by logit function.

Titanic Example (http://pages.uoregon.edu/aarong/teaching/G4075_Outline/node15.html)

What was the overall odds (both male and female) of survival of Titanic if 38% passengers survived?

O (survival) = 0.38/(1 - 0.38) = 0.61

So, for every single death on the Titanic, there was an average 0.61 survivors.

Now, let’s look at the odd ratio of survival of men and women in the Titanic where 19% of men and 73% of women survived.

O (women) = 2.67 & O (men) = 0.24

OR = O(women)/O(men) = 11.125

It indicates that women were 11 times more likely as men to survive the Titanic.

Probability Distribution

It is the list of all values that the random variable can assume with their corresponding probabilities. It is the set of all possible outcomes of the random phenomenon being observed. 

For rolling a dice, each outcome has the probability of 1/6 and the list of values are {1*1/6, 2*1/6, ... 6*1/6).

E[X] = 3.5, Var(X)=2.9 & SD=1.7078

Normal (Gaussian) Distribution or Bell Curve

It is the continuous probability distribution that follows the Central Theorem. There is a symmetry between the left and right i.e. data is distributed equally around the central value called mean. The value of the normal distribution diminishes and is practically zero when the value X lies more than a few standard deviations away from the mean. The mean, median and more are all equal in normal distribution as showed in Fig. 1 

Fig. 1. Normal Distribution [3]

 

The data in the normal distribution are so distributed that nearly all data (99.7%) will be spread within 3 standard deviation from the mean. 

This is empirically called three sigma rule. The standard 68-95-98.7 rule as showed in Fig. 2 is widely used in statistics.

 Fig. 2: Normal Distribution demonstrating 68-95-99.7 Rule [1]

Central Theorem: Given a certain condition, the arithmetic mean of a sufficiently large number of iterations of independent random variables, each with a well defined expected value and variance, will be approximately normally distributed regardless of the underlying distribution.

Example:

Let's roll n 6-sided fair dices where n is very large, then the distribution of the sum of the rolled numbers i.e. sum of outcome of dices, as a result of random events, will be well approximated by a normal distribution which follows the central theorem. Let's create probability distribution of sum of outcomes.

Let's consider two dices, A and B i.e. n=2. Then, it will have the sample space of 36 outcomes as showed in table 1 below.

Table 1: Sample space of outcomes by rolling two dices [1]

Total sum of all possible outcomes is 252 which gives the mean of 252/36 = 7.

Let's calculate the probability of each outcome out of all possible outcomes.

Data 2 and 12 has the lowest probability of 1/36 and data 7 has the highest probability of 6/36 = 1/6. Similarly, the probability distribution with larger value of n can be calculated which results in probability distribution as in Fig. 3.

Fig. 3: Probability density of sum of outcomes from rolling n number of dices [1]

Bernoulli Distribution

It is a probability distribution of a random variable which takes the value 1 with a success probability "p" and 0 with the failure probability "q".

if X is a random variable, then Pr(X=1) = p & Pr(X=0) = 1 - p = q. In the example above, rolling two dices, the success probability for random variable 7 is 1/6. Then the failure probability will be 1 - 1/6 = 5/6.

In terms of tuples,

f(k;p) = p if k = 1 & 1 - p if k = 0

This can be re-written as:

, where k is considered as success.

Mean or Expected value = E[X] = Pr(X=1)*1 + Pr(X=0)*0 = p*1 + q*0 = p


In the Bernoulli distribution, there is k success in 1 trial i.e. n = 1. In Binomial Distribution, there are n trials with two possible outcomes (success or failure). The probability of getting exactly k successes in n trials is given by three tuples:

, where (n k) pronounced as n choose k is

(n k) is also the co-efficient of xk term in the polynomial expansion of the binomial power (1 + x)n; where k = 0 to n. It results in Pascal triangle.

Mean = n* Mean(Bernoulli) = np & variance = n* Var(Bernoulli) = npq

Example: What is the possible combinations of 5 cards hand out of 52 card deck?

Here, k =5 and n = 52

(n k) = 52!/5!(52 - 5)! = 52*51*50*49*48/(5*4*3*2*1) = 2,598,960

Example: A perfect coin is tossed 10 times, what is the probability that exactly 6 heads occur and what is the expected value of head (i.e. # of heads expected in 10 coin tosses)?

Here,

number of trials (n) = 10

p(head) = 0.5; k (success) = 6

p(6;10;0.5) = (6 10)* 0.5^6(1-0.5)^(10-6) = 10*9*8*7/4*3*2 (0.015265*0.0625) = 5*3*2*7*0.0009765625 = 0.205078125

using R (see Section: using Software),

> dbinom(6,10,.5)

[1] 0.2050781

E(head) = n (Pr(X=head) * 1 + Pr(X=tail)*0) = 10*.5 = 5. So, we are expecting to get head 5 times in 10 coin tosses.

Example: Find the mean (Expected value) and variance for the number of sixes by rolling 30 dices.

success of rolling six i.e. p(6) = 1/6 and 1 -p = ⅚.

E[6] = n*p = 30*1/6 = 5

Var(6) = npq = 30*⅙*⅚ = 25/6

Standard Error or Standard Error of the Mean (SEM): In reality there is a true mean and true standard deviation, and these are unknown. A sample mean deviates from the actual (true) mean of a population; this deviation is the standard error. The standard error is also inversely proportional to the sample size; the larger the sample size, the smaller the standard error because the statistic will approach the actual value. Standard Error estimates the standard deviation of the sample mean based on the population mean.

In Normal distribution, most values (in fact about 95% of them) lie between μ − 2σ and μ + 2σ . As the distribution of sample means has mean "m" and standard deviation equal to the "SE", there is a 95% chance that the sample mean is between μ ± 2SE, which amounts to saying that there is a 95% chance that μ lies between m ± 2SE and this interval is called the Confidence Interval (CI).

CI = (4 - 2*2.0167, 4 + 2*2.0167) = (0.0334,8.03334)

Regressions

Regression is a statistical process for estimating a relationship among variables - a dependent variable and one or more independent variable(s) or predictors. The dependent variable represent the outcome or variation whereas the independent variable represent input or causes which is the potential reason for variation in output.

In a function, Y = f(X) = X, Y is the dependent variable which depends on X.

Regression analysis helps to understand how the typical value of dependent variable changes when any one of the independent variable is varied holding other variables fixed.

Linear Regression

Here the predictors depend linearly on the unknown independent parameters.

Case study [4]: 

Based on the math aptitude test score, the department of statistics is selecting the 5 students based on three questions:

(Note: see the section "B. Calculate Regeression) under Using Software to see the calculations using R)

The graph "Statistics Grades Vs Aptitude Scores" looks like the following:

Let's try to use the linear regression model, which gives the predicted value as:

where, Y => Statistics Grade; X => Aptitude Test; b => y-intercept; m=> slope of the predicted line

The predicted values from the model will slightly deviate from the observed or actual value given as we try to fit a linear line that will have the least space between itself and the data points. The difference between the actual value yi and the observed or predicted value yi' from the model is called the Residual.

Let’s compute the deviation in y direction, square each one, and add up the squares. So, the line that best fit is the line for which sum of the squares is the least. This method is called the Method of Least Squares. So, the problem boils down to the minimization problem.

Find the value of m and b that minimizes the quantity E i.e.

So, the value of b can be calculated as:

Similarly, the value of m is calculated as:

Plugging in the values, we get

m = 0.644 & b = 26.768

It is the best practice to represent the problem with a cost function or loss function for the average sum of squared errors as showed.

Here, the division by 2 is introduced for simplification purpose while taking derivative of the quadratic quantities. Though it looks different from the function we minimized earlier, it will not affect the the value of the minimized error.

1. The Linear Regression Equation becomes: y = 26.768 + 0.644x

2. If the aptitude score is 80 i.e. x = 80, his/her estimated statistics grade will be:

y = 26.768 + 0.644*80 = 78.288

Residuals: Residuals (fitting deviation) are the measures of deviation of an observed value from its predicted values.

For the aptitude score 80, the observed value of Statistics grade is 70 but the predicated value form the linear regression model is 78.288. Hence, residues = 70 - 78.288 = -8.288.

Coefficient of Determination (r2): It determines how well the data fits the statistical model i.e. how well the observed data are replicated by the model. It is quotient of the variances of the fitted values and the observed values of the dependent variable. Coefficient value 1 indicates the best fit and 0 indicates bad fit.

3. The value (r^2 = 0.480) indicates that about 48% of variation in statistics grades can be explained by the relationship to math aptitude scores which is a fair fit to data in the sense that it would substantially improve an educator's ability to predict student before taking statistics class.

Degree of Freedom (DF): The number of independent pieces of information that go into the estimate of a parameter are called the degrees of freedom. The concept is central to the principle of estimating statistics of populations from samples of them. For an example, the sample variance has n - 1 degrees of freedom, since it is computed from n independent values minus the only 1 parameter estimated as intermediate step, which is the sample mean.

In this linear regression example, it is the difference between the number of observations included in the training sample (i.e. 5) and the number of variables used in the model (intercept & slope) i.e. 2. Basically, you are loosing one degree of freedom for each parameter estimated prior to estimating the residual standard deviation. Therefore,

DF = 5 - 2 = 3

Hypothesis Testing: The sample in statistics is used to test which statement generated about the population parameter is most likely. The statement, Null Hypothesis (H0), assumes that whatever is trying to be proved does not happen (i.e. something equals to zero). Alternative Hypothesis (HA), on the other hand, is the hypothesis that is supporting whatever is trying to be proved. The hypothesis is based on available information and the investigator's belief about the population parameters.

H0: The Aptitude test score has no effect on the Statistics grade

HA: The Aptitude test score has a positive effect on Statistics grade

See the figure below for different methods of Hypothesis Testing:

Level of Statistical significance: It is a threshold level below which the null hypothesis will be rejected. Typically, alpha = 0.05 indicates a 5% risk of inferring that if there was a 5% or less chance, one would reject the Null hypothesis and accept the alternative hypothesis.Given that the Null Hypothesis is true, how likely would it be to see the decrease in mean grades in Statistics?  In the example p-value  or the significance level is 0.1945

P-Value: P-Value is a statistical measure that determines whether or not the hypothesis are correct or are of significance. P-values are the probability of obtaining an effect at least as extreme as the one in your sample data, assuming the truth of the null hypothesis.

Example: A company prints baseball cards and it claims that 30% of the cards are rookies, 60% veterans and 10% all star. Let’s calculates its p-value using 0.05 (alpha) level of significance i.e. below this value, the hypothesis or claim will be rejected.

Suppose a random sample of 100 cards has 50 rookies, 45 veterans and 5 all stars

There are 3 random variables i..e k = 3 and df = k - 1 = 3 - 1 = 2

The Observed counts are:

Orookie= 50, Oveteran = 45, and Oallstar = 5

Based on the claims, out of 100 card, the expected baseball cards are:

Erookie = 30, Eveteran=60 and Eallstar = 10

Since chi-squared value is large which indicates that observed and expected values are far apart. Looking at Chi-Squared Distribution table, for df = 2, the p-value (chi-squared > 19.58) is 5.592e-05 which is less than the cutoff value of 0.05. Hence, the null hypothesis or claim will be rejected. See the R commands under A.

Logistic Regression

There are many situations where there are binary outcomes (e.g. student is in the honors class or not) or the outcomes are categorical. For the binary outcome Y, we want to model the conditional probability (Pr(Y=1|X=x)) as a function of x.

In terms of logit function, it becomes:

It is the linear function of x and the logistic regression model becomes:

Then, the probability becomes:

Conditional Probability: It is the measure of the probability of event given that other event has occurred. 

In table 1, the probability of getting 2 on dice A i.e

Pr(A=2) = 6/36 = 1/6

Similarly, 

Pr(A+B<=5) = 10/36

So, the probability of getting 2 on dice A given the space of outcome 10 where the outcome is smaller than equal to 5 and is revealed, is give by:

Pr(A=2 | A+B <=5) = 3/10

Case Study [5]: 

Consider a outcome variable "hon" indicating whether the student is in the honor class or not and the predictor variables gender, math and reading. The scatter plot, Honor Vs Math, shows the categorical outcomes of 0 (No) or 1 (Yes).

 Case I: with No predictor variable; only the intercept i.e. Y = β0

Check the Frequency table for honors.


hon freq 

 0  151 

 1  49

The overall probability of being in the honors class is 49/200 = 0.245. So, the odds of being in honor class = .245/(1 - .245) = .3245

logit(.3245) = ln(.3245) = -1.2547. In other words, the intercept (β0 = -1.2547) from the model with no predictor variables is the estimated log odds of being in honors class for the whole population of interest.

Case II: with a single dichotomous predictor variable i.e. Y = β0 + β1*female

Check the Frequency table for female predictor (0=>male; 1=>female)

 

 hon   0   1

  0   74  77

  1   17  32

Calculate the odds of being in the Honors Class:

Pr (male) = 17/91

O (male) = (17/91)/(1 - 17/91) = (17/91)/(74/91) = 17/74 = 0.23

O (female) = 32/77 = 0.42

log(O(male)) = ln (0.23) = -1.4697 (B0 ~ -1.470852 (from model)

log(O(female)) = ln (0.42) = -0.8675

The ratio of the odds for female to the odds for male = 0.42/0.23 = 1.809 i.e. the odds for female are about 81% higher than the odds for males.

B1 = ln (1.809) = 0.59277 ~ .59278 (from model)

The intercept of -1.471 is the log odds for males since male is the reference group (female = 0). The coefficient for females is the log of odds ratio between the female group and male group: log(1.809) = .593

Case III: with a single continuous Predictor variable i.e. Y = β0 + β1*math

The plot with actual logistic regression is showed below:

The intercept in this model (β0 = -9.79394) corresponds to the log odds of being in an honors class when math is at the hypothetical value of zero. In other words, the odds of being in an honors class when the math score is zero is exp(-9.793942) = .00005579 which is very low.

Substituting the value of B0 and B1, and plugging in math scores 54 and 55, in the logit equation, we can find that for a one-unit increase in math score, we expect to see about 17% increase in the odds of being in an honors class. This 17% of increase does not depend on the value that math is held at. 

Case IV: with all predictors 

By holding math and reading at a fixed value, the odds of getting into an honors class for females (female = 1) over the odds of getting into an honors class for males (female = 0) is exp(.979948) = 2.66 i.e. the odds for females are 166% higher than the odds for males. The coefficient for math says that, holding female and reading at a fixed value, we will see 13% increase in the odds of getting into an honors class for a one-unit increase in math score since exp(.1229589) = 1.13. 

Maximum Likelihood estimation (MLE)

MLE is a method of estimating the parameters of a statistical model given only the sample data (observed data) from the overall population. It finds the value of the parameters which makes the observed data most likely to have occurred. It is useful in cases where it is infeasible or expensive to measure each individual parameter.

Example [6]: 30%,45%, and 50% of birds from location A, B, and C are infected with West Nile virus. There is a sample of 100 birds either from location A, B or C with 40 cases of Nile virus. Given this observed data which value of p (i.e. probability parameter) makes the observed data most likely to have occurred? 

From the R  calculations for Binomial distribution given below (see Section: using Software), at p = 0.45, likelihood is maximized for the observed data. So, the sample has higher probability to contain birds that belong to location B. 

> dbinom(40,100,.45)

[1] 0.0488029

> dbinom(40,100,.5)

[1] 0.01084387

> dbinom(40,100,.3)

[1] 0.008490169

The likelihood function in terms of p can be written as

L(p) = (n k) pk (1-p)n-k

And, it is maximized when

d(L(p))/dp = 0

which gives,

 p = k/n = 40/100 = 0.4

So, the maximum likelihood estimate for p is p'=0.4 where the likelihood is maximized,

> dbinom(40,100,.4)

[1] 0.08121914

Using MLE we can help determine the value of β0 and β1 of the logistic regression that make the observed data most likely to have occurred. In the logistic regression example, the probability of honor p(xi) given math grade equals to xi is given by:

p(xi) = e^( β0β1*xi )/(1 - e^( β0β1*xi)

Since p is the function in x, and y is the observed dataset which is either 0 or 1, the likelihood for each input data point in terms of Bernoulli distribution then becomes [7],

Since p(x) is the function in β0 and β1, in the logarithmic form, it becomes:

For maximum likelihood,

d(l)/(βj) = 0

Taking the derivative with respect to one component of β, say βj, it becomes,

That’s a transcendental equation, and there is no closed-form solution for this equation.We can however approximately solve it numerically. Using Newton-Ralphson method, it becomes:

βn+1 = βn - f’(βn)/f’’(βn)

For, details on Newton Raphson method, please find the Reference Guide.

Using Software

Request a compute node

srun --pty bash

Load the appropriate module

module load R

Calculate the statistical measures:

The psych library is available in R/3.2.0. You may need to install psych library following the Software Installation Guide

module load R/3.2.0

R

> library(psych)

mydata = c(2, 1, 3, 2, 2, 14)

> describe(mydata)

  vars n mean   sd median trimmed  mad min max range skew kurtosis   se

1   1 6    4 4.94   2       4 0.74   1  14   13 1.31  -0.16 2.02

Calculate the Confidence Interval (CI) with 95% confidence level which imply 97.5th percentile of the student t distribution at the upper tail:

> mydata = c(1,2,2,2,3,14)

> E = qnorm(0.975)*describe(mydata)$se   # for t distribution it is qt(.975, df=n-1)

> E

[1] 3.952459

> ci = c(-E,E) + describe(mydata)$mean

> ci

[1] 0.04754095 7.95245905

Using student t-distribution:

> t.test(mydata)

        One Sample t-test

data:  mydata

t = 1.9835, df = 5, p-value = 0.1041

alternative hypothesis: true mean is not equal to 0

95 percent confidence interval:

 -1.18383  9.18383

sample estimates:

mean of x

        4

Calculate Regression:

Copy the CSV file "regression.csv" in your home directory

Calculate the co-efficient "b" and the slope "m => AptitudeTestScore"

module load R

R

> aptitude = read.table("regression.csv", sep=",", header=T)

> head(aptitude)

  Student AptitudeTestScore StatisticsGrade

1       1                95              85

2       2                85              95

3       3                80              70

4       4                70              65

5       5                60              70

> reg = lm( StatisticsGrade ~ AptitudeTestScore, data=aptitude )

> reg

Call:

lm(formula = StatisticsGrade ~ AptitudeTestScore, data = aptitude)

Coefficients:

      (Intercept)  AptitudeTestScore

          26.7808             0.6438

Now, fit the curve in the scattered plot.

> par(cex=.8)

> plot(reg)

> abline(reg)

You will see the scatter plot as showed:

Calculate the "Coefficient of Determination (r^2) for the linear regression

> summary(reg)$r.squared

[1] 0.4803218

See the Regression Summary

> summary(reg)

Call:

lm(formula = StatisticsGrade ~ AptitudeTestScore, data = aptitude)

Residuals:

     1      2      3      4      5

-2.945 13.493 -8.288 -6.849  4.589

Coefficients:

                  Estimate Std. Error t value Pr(>|t|)

(Intercept)        26.7808    30.5182   0.878    0.445

AptitudeTestScore   0.6438     0.3866   1.665    0.194

Residual standard error: 10.45 on 3 degrees of freedom

Multiple R-squared:  0.4803,    Adjusted R-squared:  0.3071

F-statistic: 2.773 on 1 and 3 DF,  p-value: 0.1945

Chi-Squared Test:

> observed = c(50,45,5)

> expectedProb = c(0.3,0.6,.1)

chisq.test(observed, p=expectedProb)

        Chi-squared test for given probabilities

data:  observed

X-squared = 19.583, df = 2, p-value = 5.592e-05

Logistic Regression

Plot:

data = read.table("sample.csv", sep=",", header=T)

plot(data$math,data$hon,main="Regression Plot",xlab="Math Score",ylab="Honor")

Logit function:

Case I: no predictor

mylogit = glm(hon ~ 1, family=binomial(link=logit), data=data) > summary(mylogit) Call: glm(formula = hon ~ 1, family = binomial(link = logit), data = data) Deviance Residuals: Min 1Q Median 3Q Max -0.7497 -0.7497 -0.7497 -0.7497 1.6772 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.1255 0.1644 -6.845 7.62e-12 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 222.71 on 199 degrees of freedom Residual deviance: 222.71 on 199 degrees of freedom AIC: 224.71 Number of Fisher Scoring iterations: 4

Case II: one dichotomous predictor

> mylogit = glm(hon ~ female, family=binomial(link=logit), data=data) > summary(mylogit) Call: glm(formula = hon ~ female, family = binomial(link = logit), data = data) Deviance Residuals: Min 1Q Median 3Q Max -0.8337 -0.8337 -0.6431 -0.6431 1.8317 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.4709 0.2690 -5.469 4.53e-08 *** female 0.5928 0.3414 1.736 0.0825 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 222.71 on 199 degrees of freedom Residual deviance: 219.61 on 198 degrees of freedom AIC: 223.61 Number of Fisher Scoring iterations: 4

Case III: one continuous predictor

> mylogit = glm(hon ~ math, family=binomial(link=logit), data=data

Plotting with logistic Regression:

> plot(data$math,data$hon,main="Math Score Vs Honors",xlab="Math Scores",ylab="Probability of Honors") > curve(predict(mylogit,data.frame(math=x),type="resp"),add=TRUE)

Note: add=True => adding on the plot

Case IV: all predictors

> mylogit = glm(hon ~ female + math + read, family=binomial(link=logit), data=data)

or

> mylogit = glm(hon ~ ., family=binomial(link=logit), data=data)

Odd Ration and CI:

> exp(cbind(OR = coef(mylogit), confint(mylogit))) Waiting for profiling to be done... OR 2.5 % 97.5 % (Intercept) 0.2297297 0.1312460 0.3792884 female 1.8090145 0.9362394 3.5929859

Count:

> library(plyr)

> count(data,'female') female freq 1 0 91 2 1 109 > count(data,'hon') hon freq 1 0 151 2 1 49

> xtabs(~ hon + female, data=data) female hon 0 1 0 74 77 1 17 32

References:

[1] Wikipedica - wikipedica.com

[2] Log of Odds: http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm

[3] https://www.mathsisfun.com

[4] Linear Regression Tutorial - http://stattrek.com/regression/regression-example.aspx?tutorial=ap

[5] Logistic Regression - http://www.ats.ucla.edu/stat/mult_pkg/faq/general/odds_ratio.htm

[6] Maximum Likelihood Estimation (MLE) - http://www.wright.edu/~thaddeus.tarpey/ES714glm.pdf

[7] MLE (Logistic) - http://www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf