Linear regression is frequently the "go-to" most practical approach to modelling the relationship between two or more variables. When we are convinced that a cause and effect can be observed between two variables then we designate a dependent variable and link this to one or other explanatory variables (or independent variables). Please find explanation of concepts simplified and broken down in Useful Statistical Concepts section. Also MITOPENCOURSEWARE provide free yet state of the art instruction in data analytics - ideal for small Entrepreneurs. Their online course examines real world examples of how analytics have been used to transform a business . These examples include the Framingham Heart Study, Google, Twitter, IBM Watson, Netflix and of course predicting Wine Quality. Through these examples they cover the following analytics methods: linear regression, logistic regression, trees, machine learning and AI. Perhaps, Anne and Pete from our vineyard startup micro-enterprise may find some practical help in predicting Wine Quality using modeling. Below we provide some background to a set of statistical techniques that are both controversial and thought - provoking among the wine elite.
March 1990 - Orley Ashenfelter, a Princeton economics professor, claimed he could predict wine quality without tasting the wine. Ashenfelter used the linear regression method to reveal some of the nuances perhaps overlooked when pricing red wines exclusively by nose, smell and swirl. Red wines have been produced in the Bordeaux region of France in much the same way, for hundreds of years. Yet, there are variations in quality and price from year to year that can often be quite significant. Orley noted that these quality differences had been depicted in the trade as a great mystery. In his thought provoking and somewhat controversial paper Orley demonstrated that the factors that drive fluctuations in wine vintage quality can be accounted for in a simple quantitative almost prosaic way. In brief, Orley showed that a simple statistical analysis would predict the quality of a vintage, and by extension its price, from the weather over its growing season. In the same paper, Orley showed how the aging of wine influences its price, and under what conditions it pays to purchase wines before they peak for drinking. Orley furnished an appraisal of this procedure and level of success in predicting wine quality. He also discussed the role this information has played in the evolution of the wine trade. Below some useful introductory explanation from 1992 is provided by Good Morning America by ABC .
Orley used Linear Regression to predict the price of wine and he hoped that this technique would be more accurate than traditional methods. He described the task in hand as follows: "When a red Bordeaux wine is young it is astringent and most people will find it unpleasant to drink. As a wine ages it loses its astringency. Because Bordeaux wines taste better when they are older, there is an obvious incentive to store them until they have come of age. As a result, there is an active market for both younger and older wines. Traditionally, what has not been so obvious is exactly how good a wine will be when it matures. This ambiguity leaves room for speculation, and as a result, the price of the wine when it is first offered in its youth will often not match the price of the wine when it matures. The primary goal in this paper is to study how the price of mature wines may be predicted from data available when the grapes are picked, and then to explore the effect that this has on the initial and final prices of the wines. A secondary goal is to show how this straightforward hedonic method has now been used in many other grape growing regions to quantify the role the weather plays in determining the quality of wine vintages."
To understand how this task may be achieved we follow techniques and code explained by MITOPENCOURSEWARE. I have made a few small minor edits and will demonstrate how the R code below can be loaded into RStudio Cloud and how you load data onto same platform.
#https://ocw.mit.edu/courses/sloan-school-of-management/15-071-the-analytics-edge-spring-2017/linear-regression/the-statistical-sommelier-an-introduction-to-linear-regression/video-4-linear-regression-in-r/
str(wine)
summary(wine)
# Linear Regression (one variable)
model1 = lm(Price ~ AGST, data=wine)
summary(model1)
# Sum of Squared Errors
model1$residuals
SSE = sum(model1$residuals^2)
SSE
# Linear Regression (two variables)
model2 = lm(Price ~ AGST + HarvestRain, data=wine)
summary(model2)
# Sum of Squared Errors
SSE = sum(model2$residuals^2)
SSE
# Linear Regression (all variables)
model3 = lm(Price ~ AGST + HarvestRain + WinterRain + Age + FrancePop, data=wine)
summary(model3)
# Sum of Squared Errors
SSE = sum(model3$residuals^2)
SSE
# Video 5
# Remove FrancePop
model4 = lm(Price ~ AGST + HarvestRain + WinterRain + Age, data=wine)
summary(model4)
# VIDEO 6
# Correlations
cor(wine$WinterRain, wine$Price)
cor(wine$Age, wine$FrancePop)
cor(wine)
# Remove Age and FrancePop
model5 = lm(Price ~ AGST + HarvestRain + WinterRain, data=wine)
summary(model5)
# VIDEO 7
# Read in test set
#wineTest = read.csv("wine_test.csv")
str(wine_test)
# Make test set predictions
predictTest = predict(model4, newdata=wine_test)
predictTest
# Compute R-squared
SSE = sum((wine_test$Price - predictTest)^2)
SST = sum((wine_test$Price - mean(wine$Price))^2)
1 - SSE/SST