1. Concepts & Definitions
1.1. Linear regression: Concepts and equations
1.2. Linear regression: Numerical example
1.3. Correlation is no causation
1.4. Dummy and categorical variables
1.5. Multiple linear regression
1.6. Dummy multiple linear regression
2. Problem & Solution
2.1. Predicting Exportation & Importation Volume
In simple words, linear regression can be defined as a mathematical model that establishes the relationship of a variable y to one or more other independent variables with a linear dependence function. Typically, linear regression is used to determine a “line of best fit” that passes through a set of data points. The task of the best line fitting can be solved with the least square method.
The basic form is known as simple linear regression, which is a statistical technique that can be used to understand the quantitative relationship between two variables, x (predictor) and y (response) through a linear equation of the form y = β0 + β1x. The task of linear regression is to predict future value of y using a sample of previously observed x and y pairs.
Now in mathematical terms, let this be a set of n pairs of data (x1, y1), (x2, y2), ..., (xn, yn) tabulated such that each independent input variable xi can be associated with an output variable (dependent ) yi, for all i = 1,..., n. We want to find the linear function f(x) that establishes the linear relationship between the input variable and the output variable. Given an input (xi), this function tries to predict the output ŷi employing the following equation.
Alternatively, the equation for the linear function f(x) can be written using a polynomial formula since linear regression is a special case of polynomial regression.
The predicted output should be a real number (4.13, 57.342, 13089.21 etc.) like the price of a house, stock share price after a certain period of time, expected sales revenue for a certain period, expected customs duty amount next month and so on.
To determine the linear function f(x) is necessary to find the coefficients ß0 and ß1. For this purpose, it is necessary to find the coefficients that minimize the sum of the quadratic difference (εi) between the observed output value (yi) and the predicted output (ŷi) by the model for a given input value (xi). All these elements and their relation are summarized in the following chart.
The previous figure could be expressed in mathematical terms as by the following optimization problem.
The partial derivatives in terms of the coefficients ß0 and ß1 will lead to the following equations to obtain them employing the tabulated pair of data (x1, y1), (x2, y2), ..., (xn, yn).
Where:
The calculation formulas for Sxy and Sxx require only the summary statistics Σxi, Σyi, Σxiyi, and Σxi*xi (as the sum of values in columns of a table) to be calculated. After finding the coefficients, it is possible to extract some metrics that help to understand how well the equation found could explain the relation between inputs and outputs.
The performance of a linear regression model can be evaluated using a set of various metrics. The most common metrics are presented in the next table.
Where:
n = the number of observations,
yi = the actual value of the dependent variable for the i-th observation,
ŷi = the predicted value of the dependent variable for the i-th observation,
k = the number of predictors.
For more information about key metrics, their interpretation, pros and cons and limitations please refer to [6].
The function used to calculate the error εi between the actual output value yi and predicted output value ŷi is known as the loss function. It is also often called "reduced error" or "cost". The loss function is a measure of the number of errors that linear regression makes on a data set. The calculated error can be deemed as a distance between the predicted value of ŷi and its actual value yi. Mean Squared Error (MSE) is the most common loss function.
The coefficient of determination or R-squared is calculated using the ratio of the error sum of squares (SSE) and the total sum of squares (SST). This section explains how to calculate SSE and SST.
The error sum of squares SSE (or residual sum of squares) can be interpreted as a measure of how much variation in y is left unexplained by the model, or how much cannot be attributed to a linear relationship [1].
A quantitative measure of the total amount of variation in observed y values is given by the total sum of squares (SST). The SST is the sum of squared deviations about the sample mean of the observed y values - when no predictors are taken into account.
The ratio SSE/SST can be interpreted as the proportion of total variation that cannot be explained by the simple linear regression model.
Thus R-squared metric (or coefficient of determination) is the proportion of the observed y variation explained by the model. Both metrics could be related through the next equation.
It is important to note that R-squared is always a number between zero and one. The higher the R-squared, the more successful the linear regression model is at explaining y variation in terms of x variation.
The coefficient of determination can be written in a slightly different way by introducing a third sum of squares, which is the regression sum of squares (SSR):
The regression sum of squares (SSR) could be interpreted as the amount of total variation that is explained by the model. With this new equation, the equation for R-squared can be expressed in a new form.
Using this new form, R-squared could be interpreted as the ratio of explained variation to total variation.
The next figure depicts the relationship between a linear regression equation and its quantitative measures, which make up the R-squared metric.
R-squared has an inherent problem: additional input variables will cause R-squared to remain the same or increase (this is due to how the R-squared is calculated mathematically). Therefore, even if the additional input variables do not show any relationship with the output variables, the R-squared will increase [4].
Essentially, the adjusted R-squared looks at whether additional input variables contribute to the model. Then adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in a regression model. It is calculated as [5]:
where:
R2 : The R-squared of the model,
n: The number of observations,
k: The number of predictor variables.
Because R-squared always increases as more predictors are added to a model, adjusted R-squared can serve as a metric that shows how useful a model is when adjusted for the number of predictors in a model.
Adjusted R-squared increases only if the new variable improves the model more than would be expected by chance, and decreases when the predictor improves the model less than would be expected by chance. This makes it a more reliable metric when comparing models of different complexities.
In this sense, adjusted R-Squared should be used for multiple regression models with a different number of predictors [6].
Regression analysis is a form of inferential statistics. P-values in regression help determine whether the relationships observed in a sample also exist in the larger population [2, 3].
P-value for each independent variable in linear regression tests the null hypothesis that the variable has no correlation with the dependent variable. If there is no correlation, then there is no association between changes in the independent variable and shifts in the dependent variable. In other words, there is not enough evidence to conclude that there is an effect at the population level.
If the P-value for a variable is less than a significance level, then sample data provide enough evidence to reject the null hypothesis for the entire population. The data support the hypothesis of a non-zero correlation. Changes in the independent variable are associated with changes in the dependent variable at the population level. This variable is statistically significant and is likely a worthy addition to the regression model.
On the other hand, when the P-value in a regression is greater than the significance level, it indicates that there is not enough evidence in the sample to conclude that a non-zero correlation exists.
Linear regression is a widely used statistical technique that has a number of practical applications in real world. Here are some common applications of linear regression:
In economics and finance to predict stock price based on historical data and various indicators,
Analyze the impact of product marketing strategies, for example to understand the dependence of product sales on the advertising budget spent. Product sales = β0 + β1*(product advertising expenses), where β0 is expected sales volume when advertising expenses are zero,
In healthcare, for example to predict healthcare costs for patients depending on age, gender, and pre-existing conditions: Healthcare cost = β0 + β1*(age) + β2*(gender) + β3*(condition),
In medical research, to understand the relationship between variables such as medication dosage and the patient's condition, e.g., blood pressure vs drug dosage: Blood pressure = β0 + β1*(drug dosage),
In education: to analyze the relationship between student performance and factors like study time, class attendance, and socioeconomic background: Student performance = β0 + β1*(study time) + β2*(class attendance) + β3*(socioeconomic background),
In agriculture, to understand how productivity depends from the amount of fertilizer applied and amount of water: Productivity = β0 + β1 (amount of fertilizer) + β2 (amount of water).
References
[1] https://medium.com/@arunp77/regression-model-f411b4042445
[2] https://www.statology.org/statsmodels-linear-regression-p-value/
[3] https://statisticsbyjim.com/regression/interpret-coefficients-p-values-regression/
[4] https://corporatefinanceinstitute.com/resources/data-science/adjusted-r-squared/