Hypothesis 2:
H0: There is linear relationship between energy, loudness and acousticness, i.e.
Acousticness=a*energy + b*loudness+ c
· Linear Regression:
Linear regression is a linear model to fit the independent variable X and dependent variable Y. It assumes there is a linear relationship between X and Y, residuals should be approximately normally distributed and homoscedasticity. Techniques to estimate the coefficients like OLS , max likelihood are commonly used.
Motivation:
In our dataset, there seems to be some connections between energy, loudness and acousticness. A linear model is used to fit the relationship.
Experiments:
1. Independent variable: energy, loudness.
2. Dependent variable: acousticness.
3. OLS is used to estimate the coefficients.
Results:
p values for coefficients are 0, which means coefficients are statistically significant. The R square is 0.732, which means this fitting is acceptable. Thus, there is a linear relationship between acousticness, energy and loudness. From the table, this can be represented as
Acounsticness=-0.662*energy-0.0159*loudness+0.591
Discussion of Multicollinearity in Statistics and Machine Learning
Considering multicollinearity is important in statistical regression analysis because, in extrema, it directly bears on whether or not your coefficients are uniquely identified in the data. In regression, we are trying to understand the impact the independent variables have on the dependent variable. If there is strong multicollinearity, this is simply not possible. No algorithm is going to fix this. If studiousness is correlated with class attendance and grades, we cannot know what is truly causing the grades to go up - attendance or studiousness.
However, Machine Learning techniques that focus on predictive accuracy, all we care about is how we can use a set of variables to predict another set. We don't care about the impact these variables have on each other.