If you’re just beginning to learn about regression analysis, a simple linear is the first type of regression you’ll come across in a stats class. Linear regression is the most widely used statistical technique; it is a way to model a relationship between two sets of variables. The result is a linear regression equation that can be used to make predictions about data.
Most software packages and calculators can calculate linear regression. For example:
TI-83.
Excel.
You can also Find a linear regression by hand.
Before you try your calculations, you should always make a scatter plot to see if your data roughly fits a line. Why? Because regression will always give you an equation, and it may not make any sense if your data follows an exponential model.
Etymology
“Linear” means line. The word Regression came from a 19th-Century Scientist, Sir Francis Galton, who coined the term “regression toward mediocrity” (in modern language, that’s regression to the mean. He used the term to describe the phenomenon of how nature tends to dampen excess physical traits from generation to generation (like extreme height)
Why use Linear Relationships?
Linear relationships, i.e. lines, are easier to work with and most phenomenon are naturally linearly related. If variables aren’t linearly related, then some math can transform that relationship into a linear one, so that it’s easier for the researcher (i.e. you) to understand.
What is Simple Linear Regression?
You’re probably familiar with plotting line graphs with one X axis and one Y axis. The X variable is sometimes called the independent variable and the Y variable is called the dependent variable. Simple linear regression plots one independent variable X against one dependent variable Y. Technically, in regression analysis, the independent variable is usually called the predictor variable and the dependent variable is called the criterion variable. However, many people just call them the independent and dependent variables. More advanced regression techniques (like multiple regression) use multiple independent variables.
Regression analysis can result in linear or nonlinear graphs. A linear regression is where the relationships between your variables can be described with a straight line. Non-linear regressions produce curved lines.(**)
Regression analysis is almost always performed by a computer program, as the equations are extremely time-consuming to perform by hand.
As this is an introductory article, I kept it simple. But there’s actually an important technical difference between linear and nonlinear, that will become more important if you continue studying regression. For details, see the article on nonlinear regression.
How to Find a Linear Regression Equation: Overview
Regression analysis is used to find equations that fit data. Once we have the regression equation, we can use the model to make predictions. One type of regression analysis is linear analysis. When a correlation coefficient shows that data is likely to be able to predict future outcomes and a scatter plot of the data appears to form a straight line, you can use simple linear regression to find a predictive function. If you recall from elementary algebra, the equation for a line is y = mx + b. This article shows you how to take data, calculate linear regression, and find the equation y’ = a + bx. Note: If you’re taking AP statistics, you may see the equation written as b0 + b1x, which is the same thing (you’re just using the variables b0 + b1 instead of a + b.
The Linear Regression Equation
Linear regression is a way to model the relationship between two variables. You might also recognize the equation as the slope formula. The equation has the form Y= a + bX, where Y is the dependent variable (that’s the variable that goes on the Y axis), X is the independent variable (i.e. it is plotted on the X axis), b is the slope of the line and a is the y-intercept.
Describe Summaton formula: here
The first step in finding a linear regression equation is to determine if there is a relationship between the two variables. This is often a judgment call for the researcher. You’ll also need a list of your data in x-y format (i.e. two columns of data—independent and dependent variables).
Warnings:
Just because two variables are related, it does not mean that one causes the other. For example, although there is a relationship between high GRE scores and better performance in grad school, it doesn’t mean that high GRE scores cause good grad school performance.
If you attempt to try and find a linear regression equation for a set of data (especially through an automated program like Excel or a TI-83), you will find one, but it does not necessarily mean the equation is a good fit for your data. One technique is to make a scatter plot first, to see if the data roughly fits a line before you try to find a linear regression equation.
Step 1: Make a chart of your data, filling in the columns in the same way as you would fill in the chart if you were finding the Pearson’s Correlation Coefficient.
Step 2: Use the following equations to find a and b.
Step 3: Insert the values into the equation.
y’ = a + bx
In Logistic Regression, we don’t directly fit a straight line to our data like in linear regression. Instead, we fit a S shaped curve, called Sigmoid, to our observations.
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
Sometimes logistic regressions are difficult to interpret; the Intellectus Statistics tool easily allows you to conduct the analysis, then in plain English interprets the output
Types of Questions Binary Logistic Regression Can Answer
How does the probability of getting lung cancer (yes vs. no) change for every additional pound a person is overweight and for every pack of cigarettes smoked per day?
Do body weight, calorie intake, fat intake, and age have an influence on the probability of having a heart attack (yes vs. no)?
Binary Logistic Regression Major Assumptions:
The dependent variable should be dichotomous in nature (e.g., presence vs. absent).
There should be no outliers in the data, which can be assessed by converting the continuous predictors to standardized scores, and removing values below -3.29 or greater than 3.29.
There should be no high correlations (multicollinearity) among the predictors. This can be assessed by a correlation matrix among the predictors. Tabachnick and Fidell (2013) suggest that as long correlation coefficients among independent variables are less than 0.90 the assumption is met.
At the center of the logistic regression analysis is the task estimating the log odds of an event. Mathematically, logistic regression estimates a multiple linear regression function defined as:
Overfitting. When selecting the model for the logistic regression analysis, another important consideration is the model fit. Adding independent variables to a logistic regression model will always increase the amount of variance explained in the log odds (typically expressed as R²). However, adding more and more variables to the model can result in overfitting, which reduces the generalizability of the model beyond the data on which the model is fit.
Reporting the R2. Numerous pseudo-R2 values have been developed for binary logistic regression. These should be interpreted with extreme caution as they have many computational issues which cause them to be artificially high or low. A better approach is to present any of the goodness of fit tests available; Hosmer-Lemeshow is a commonly used measure of goodness of fit based on the Chi-square test.