Reference: https://towardsdatascience.com/linear-regression-explained-1b36f97b7572
Linear regression is a statistical method used to analyze the relationship between two or more variables. It is often used to predict the value of one variable based on the values of other variables. In its simplest form, linear regression assumes that there is a linear relationship between the independent variable(s) and the dependent variable. The goal of linear regression is to find the best-fit line that represents this relationship.
The best-fit line is found by minimizing the sum of the squared differences between the predicted values and the actual values. Once the best-fit line is found, it can be used to make predictions about the dependent variable for new values of the independent variable(s).
Linear regression is widely used in various fields such as finance, economics, and engineering. It is a powerful tool for analyzing data and making predictions. By understanding linear regression, you can gain valuable insights into the relationships between variables and use this knowledge to make informed decisions.
To find the best-fit line that represents this relationship, linear regression uses a method called ordinary least squares (OLS) regression. This method involves finding the line that minimizes the sum of the squared differences between the predicted values and the actual values. In other words, we want to find the line that is as close as possible to all of the data points.
To calculate the best-fit line using OLS regression, we first plot the data points on a graph with the independent variable(s) on the x-axis and the dependent variable on the y-axis. We then draw a line through the data points that are as close as possible to all of the points.
The equation for a straight line is typically given as y = mx + b, where y is the dependent variable, x is the independent variable, m is the slope of the line, and b is the y-intercept (the point at which the line crosses the y-axis). To find the best-fit line, we need to find the values of m and b that minimize the sum of the squared differences between the predicted values and the actual values. Once we have found the best-fit line, we can use it to make predictions about the dependent variable for new values of the independent variable(s).
Regression coefficients, also known as regression weights or beta coefficients, are values that represent the change in the dependent variable for a one-unit change in the independent variable. In other words, they tell us how much the dependent variable is expected to change when the independent variable changes by a certain amount. In linear regression, the regression coefficients are the values of the slope (m) in the equation y = mx + b.
Reference: https://towardsdatascience.com/linear-regression-explained-1b36f97b7572
The four main components of linear regression are:
The dependent variable: This is the variable that we are trying to predict. In linear regression, it is also known as the response variable.
The independent variable(s): These are the variable(s) that we use to predict the dependent variable. In linear regression, they are also known as the predictor variable(s) or the explanatory variable(s).
The best-fit line: This is the line that represents the relationship between the independent variable(s) and the dependent variable. It is calculated using the method of least squares, which finds the line that minimizes the sum of the squared differences between the predicted values and the actual values.
The error term: This is the difference between the predicted values and the actual values. In linear regression, we assume that the error term is normally distributed with a mean of zero and a constant variance. This assumption allows us to make statistical inferences about the regression coefficients and to calculate confidence intervals and hypothesis tests.
Together, these four components make up the linear regression model. The model can be used to make predictions about the dependent variable for new values of the independent variable(s) and to understand the relationship between the variables.
Linear regression is a powerful tool for analyzing data and making predictions, but it has some limitations. Here are some of the main limitations of linear regression and some ways to address them:
Linearity assumption: Linear regression assumes that there is a linear relationship between the independent variable(s) and the dependent variable. If the relationship is not linear, linear regression may not be the best method to use. One way to address this limitation is to use polynomial regression or other non-linear regression methods that can model non-linear relationships.
Outliers: Linear regression is sensitive to outliers, which are data points that are far away from the rest of the data. Outliers can have a large impact on the regression coefficients and the best-fit line. One way to address this limitation is to use robust regression methods that are less sensitive to outliers, such as the least absolute deviation (LAD) method or the Huber method.
Multicollinearity: Linear regression assumes that the independent variables are not highly correlated with each other. If the independent variables are highly correlated, this can lead to unstable regression coefficients and make it difficult to interpret the model. One way to address this limitation is to use methods such as principal component regression or ridge regression, which can handle multicollinearity.
Overfitting: Linear regression can be prone to overfitting, which occurs when the model is too complex and fits the noise in the data as well as the signal. This can lead to poor predictions of new data. One way to address this limitation is to use regularization methods, such as ridge regression or lasso regression, which can prevent overfitting by adding a penalty term to the regression coefficients.
Non-normality assumption: Linear regression assumes that the error term is normally distributed with a mean of zero and a constant variance. If this assumption is violated, it can lead to biased estimates of the regression coefficients and incorrect inferences. One way to address this limitation is to use methods such as generalized linear regression or robust regression, which can handle non-normality.