It demonstrates knowledge and understanding of key concepts related to preparing and conducting a statistical analysis, and ability to apply knowledge and understanding to prepare and conduct key aspects of a statistical analysis with ability to communicate the knowledge and understanding and findings from the application of my knowledge and understanding in an acceptable manner.
I will use the dataset on student exam performance to illustrate this part of the portfolio. The dataset is available for download from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/sdent+performance) where you will find it's description. It is also used in the following paper which also provides a dataset descriptor:
P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th FUture BUsiness TEChnology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7. (https://repositorium.sdum.uminho.pt/bitstream/1822/8024/1/student.pdf)
Population in statistics is an entire set from which a sample can be drawn. It can refer to group of people events or any observations. It indicates a complete group of which you want to draw a conclusion.
Let's take an example of Corona, the whole world was suffering from Corona and there were around billions of cases confirmed. So in this case, the people who were confirmed that they were contaminated with the virus, will represent the whole population.
Sample can be described as a subset that can describe the whole population's characteristics.
In the above example if let's say researchers are going to examine the patients who were confirmed with the virus, they can't just examine billion of patients. So, to generally evaluate the findings they randomly take a sample out of those patients. and that sample will describe the whole population.
Parameter is a measure that will describe a population, it can be any measure for a particular variable. While Statistic deals with describing the sample for any variable.
For example in the above scenario if suppose we are calculating the average age for the people who were deceased because of this virus, the result we get is 53 years. But if we are taking the sample and calculating, the life expectancy of the results may vary, that is the age could now be 45 years. So the 53 year is parameter while 45 year is the statistic.
This difference in the above example of statistics and parameter can be due to either of the following :
Sampling Error : It may happen that the average age for the samples we found may contain some deaths due to other reasons like bad eating habits or due to some other disease that may deviate the results away from the expected mean of population.
Selection Bias : The best example that can explain this can be electoral polls predictions. It may happen that the poll predictions were happened online and only a particular group of people who have access to internet casted the poll. This will be a bias to the prediction.
Simple Random Sample : This type of technique gives a fair opportunity to each member to be included in the sample.
Stratified Random Sample : The whole population is divided into certain groups and out of that groups some members are selected.
Cluster Random Sample : Initially the whole population is sorted in some formats and after which every x'th element is included in the sample.
Convenience Sample : This type of sample is chosen which is promptly available from various sources.
Voluntary Response Sample : This type of sampling is done when the population is asked to be a sample and only a particular group joins it based on their voluntary response.
It is a probability distribution where it shows how the data is placed across the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In the graphical representation it's shaped like a bell curve.
It is a distribution in which all the measure of central tendency are equal and the bell curve is symmetric at centre. Here, the total area under the curve is 1.
It's a point on distribution which is further contrasted with the test values to determine whether to reject the null hypothesis.
If the test values are greater than the critical values, then we can reject the null hypothesis.
This is nothing but a value which determines that whether the evidence makes our null hypothesis good or bad. It is mostly ranged between 0 and 1. The smaller this value is the more is the chance that the null hypothesis will be rejected.
Example: A p value of 0.045 means that there is a 4.5 percent probability that your results are random, which is rather low. A p-value of 0.8 (80 percent) on the other hand indicates that your results have an 80 percent chance of being totally random and unrelated to anything in your experiment. As a result, the lower the p-value, the more significant your results.
Whenever we are expressing a parameter it is always done via confidence intervals. The parameter will lie within these ranges.
Example : Like the corona example we took initially, the average age for the deceased individual was 45 years for the sample while for the whole population it was 53 years. Here it will be a better practice if we tell the statistic in a range say [43-53] rather than just a number 45.
The value of the interval is totally dependent on how varied the data is, if there are less variation then the interval will be less and vice versa.
Quantitative Variable : This type of sample is numerical data on which the normal operations of mathematics can be performed. They are further classified into two types :
Continuous : These are the variables which have non-infinite values. Example: Distance, Age and so on.
Discrete : These are the variables which can have a count. Example : Number of employees in a project and so on.
Categorical Variable : These variables represent factors of some kind. They at times are classified as numbers but these number represent particular classes. They are further classified into three sub parts :
Binary : As the name suggests the variable has only two options. Example : Yes/No, Head/Tail and so on.
Nominal : The variable has no particular order. Example : Gender, Caste etc.
Ordinal : The variables represent some kind of order. Example : Feedbacks, Grades etc.
The idea of missing values is imperative to comprehend to effectively oversee information. On the off chance that the missing qualities are not dealt with appropriately by any individual, at that point they may wind up drawing an erroneous deduction about the information. Because of inappropriate taking care of, the outcome acquired by the individual will vary from ones where the missing qualities are available.
We can either remove or replace them. Assume if the quantity of instances of missing qualities is small, at that point, we may drop or overlook those qualities from the analysis. In statistics, if the quantity of the cases is under 5% of the example, at that point we can drop them.
They are of three sorts :
Missing Completely and Random — (MCAR)
Missing at Random — (MAR)
Missing Not at Random — (MNAR)
MCAR exists when the missing values are arbitrarily spread across all perceptions. This structure can be affirmed by dividing it into two groups: one group containing the missing values while the other containing the non missing values. Subsequent to dividing the information, the most mainstream test, called the t-test is done to check whether there exists any difference between these two groups.
MAR, exists when the missing qualities are not arbitrarily spread across perceptions however are appropriated inside at least one sub-samples. This structure is more normal than the MCAR.
MNAR, If the data characters do not comply with the MCAR or MAR characters, they fall into the category of missing characters, not at random (MNAR).Missing data is not ignorable.
Deletion : Listwise Deletion, Pairwise Deletion
Single Imputation Methods : Single value Imputation, Regression Imputation
Cold deck : Because it relies on external sources, such as a number from a prior survey, cold deck imputation is an unusual approach for imputing missing values for variables. It imputes missing values known as receivers by utilising similar reported values from donors in the prior survey.
Multiple Imputation Methods : Expectation- Maximisation Algorithm, Maximum Likelihood(ML)
The technique where we select examples to become familiar with attributes in a given population is called Hypothesis testing. It is actually an orderly method to test cases.
The null and alternative hypothesis are two fundamentally unrelated articulations about a population. It utilizes test information to decide if to dismiss the null hypothesis or not.
Null Hypothesis (H0)
It expresses that a parameter be it mean, mode or standard deviation and so on, is equivalent to an assumption. It is basically a claim that depends on past examinations or particular information.
Alternative Hypothesis (H1)
It expresses that a parameter boundary is either greater, less than or equal as the speculated an incentive in the H0. It is the thing that you may accept to be valid or plan to demonstrate valid.
The dimension of our dataset
> dim(df)
[1] 382 53
The structure of our data frame is as follows:
> str(df)
tibble [382 × 53] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ school : chr [1:382] "GP" "GP" "GP" "GP" ...
$ sex : chr [1:382] "F" "F" "F" "F" ...
$ age : num [1:382] 15 15 15 15 15 15 15 15 15 15 ...
$ address : chr [1:382] "R" "R" "R" "R" ...
$ famsize : chr [1:382] "GT3" "GT3" "GT3" "GT3" ...
$ Pstatus : chr [1:382] "T" "T" "T" "T" ...
$ Medu : num [1:382] 1 1 2 2 3 3 3 2 3 3 ...
$ Fedu : num [1:382] 1 1 2 4 3 4 4 2 1 3 ...
$ Mjob : chr [1:382] "at_home" "other" "at_home" "services" ...
$ Fjob : chr [1:382] "other" "other" "other" "health" ...
$ reason : chr [1:382] "home" "reputation" "reputation" "course" ...
$ nursery : chr [1:382] "yes" "no" "yes" "yes" ...
$ internet : chr [1:382] "yes" "yes" "no" "yes" ...
$ guardian.m : chr [1:382] "mother" "mother" "mother" "mother" ...
$ traveltime.m: num [1:382] 2 1 1 1 2 1 2 2 2 1 ...
$ studytime.m : num [1:382] 4 2 1 3 3 3 3 2 4 4 ...
$ failures.m : num [1:382] 1 2 0 0 2 0 2 0 0 0 ...
$ schoolsup.m : chr [1:382] "yes" "yes" "yes" "yes" ...
$ famsup.m : chr [1:382] "yes" "yes" "yes" "yes" ...
$ paid.m : chr [1:382] "yes" "no" "yes" "yes" ...
$ activities.m: chr [1:382] "yes" "no" "yes" "yes" ...
$ higher.m : chr [1:382] "yes" "yes" "yes" "yes" ...
$ romantic.m : chr [1:382] "no" "yes" "no" "no" ...
$ famrel.m : num [1:382] 3 3 4 4 4 4 4 4 4 4 ...
$ freetime.m : num [1:382] 1 3 3 3 2 3 2 1 4 3 ...
$ goout.m : num [1:382] 2 4 1 2 1 2 2 3 2 3 ...
$ Dalc.m : num [1:382] 1 2 1 1 2 1 2 1 2 1 ...
$ Walc.m : num [1:382] 1 4 1 1 3 1 2 3 3 1 ...
$ health.m : num [1:382] 1 5 2 5 3 5 5 4 3 4 ...
$ absences.m : num [1:382] 2 2 8 2 8 2 0 2 12 10 ...
$ mG1 : num [1:382] 7 8 14 10 10 12 12 8 16 10 ...
$ mG2 : num [1:382] 10 6 13 9 10 12 0 9 16 11 ...
$ mG3 : num [1:382] 10 5 13 8 10 11 0 8 16 11 ...
$ guardian.p : chr [1:382] "mother" "mother" "mother" "mother" ...
$ traveltime.p: num [1:382] 2 1 1 1 2 1 2 2 2 1 ...
$ studytime.p : num [1:382] 4 2 1 3 3 3 3 2 4 4 ...
$ failures.p : num [1:382] 0 0 0 0 0 0 0 0 0 0 ...
$ schoolsup.p : chr [1:382] "yes" "yes" "yes" "yes" ...
$ famsup.p : chr [1:382] "yes" "yes" "yes" "yes" ...
$ paid.p : chr [1:382] "yes" "no" "no" "no" ...
$ activities.p: chr [1:382] "yes" "no" "yes" "yes" ...
$ higher.p : chr [1:382] "yes" "yes" "yes" "yes" ...
$ romantic.p : chr [1:382] "no" "yes" "no" "no" ...
$ famrel.p : num [1:382] 3 3 4 4 4 4 4 4 4 4 ...
$ freetime.p : num [1:382] 1 3 3 3 2 3 2 1 4 3 ...
$ goout.p : num [1:382] 2 4 1 2 1 2 2 3 2 3 ...
$ Dalc.p : num [1:382] 1 2 1 1 2 1 2 1 2 1 ...
$ Walc.p : num [1:382] 1 4 1 1 3 1 2 3 3 1 ...
$ health.p : num [1:382] 1 5 2 5 3 5 5 4 3 4 ...
$ absences.p : num [1:382] 4 2 8 2 2 2 0 0 6 10 ...
$ pG1 : num [1:382] 13 13 14 10 13 11 10 11 15 10 ...
$ pG2 : num [1:382] 13 11 13 11 13 12 11 10 15 10 ...
$ pG3 : num [1:382] 13 11 12 10 13 12 12 11 15 10 ...
Missing values :
In our dataset there were no such missing values, as we checked for NA values.
Checking for any 0 values in our data frame
We can see that some of the columns have 0's in their rows. After analysis we found that there can be some factors where they can have zero values like failures, absences and so on .But for marks it wasn't clear. So for all the values in the grades section we have replaced the zeroes by mean of that particular grade.
Descriptive statistics in other words are Descriptive coefficients that sum up the dataset provided. With these measure values, we can achieve insights into the dataset. There are two general categories of descriptive statistics. One is the measure of central tendency while the other is the measure of variability or distribution. Firstly, The measure of Central Tendency describes the data over the central point of the dataset. The common terms one can understand with this measure are the mean ( average of the dataset), median( the central value ), and mode ( frequency of occurrence of a value in the data). Secondly, the measure of variability or distribution defines the spread of the dataset. It defines how dispersed the data is. The common terms we find here are variance and standard deviation.
Let’s understand this with a small example. If we have a dataset with a mean of 75, we may not understand the spread of the data. There can be values that would lie on the negative scale as well. Hence, here comes the definition of variance.
To visualize the measure of the central tendency, Box plot would be easier to understand these measures.
Normality in statistics defines how well the data fits under the normal distribution curve. It is assumed that any dataset in the world would be normally distributed. A normal distribution would mean that the dataset is symmetric about its mean. Normality plays a major role in inferring the measures of distribution.
The area under the curve of the distribution a brief idea of the spread. In a normal distribution, we use the empirical formula of standard deviation. 68.27% of the data lies between -1 sd and 1 sd. 95.45% of the distribution lies between -2 sd to 2 sd. 99.73% of the data lies in the -3 sd to 3 sd range. If the data doesn’t follow this, we use non-parametric rules such as power transform to attain normality.
In this concept, we can introduce a z-score(also known as the normal/standard score). With this value, we can find the probability of finding data in the shaded region.
Steps:
Select the variable of your interest
Access the normality using Q-Q plot and Histogram.
Generate the summary statistics to check how far away from normal our data is by generating standardised for skew and kurtosis.
Calculate the percentage of standardised within the acceptable range:
95% within +/- 1.96 and 99.7% within +- 3.29 for larger distribution
#adding a normal curve
#use stat_function to compute a normalised score for each value of mG1
#pass the mean and standard deviation
#use the na.rm parameter to say how missing values are handled
gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$mG1, na.rm=TRUE), sd=sd(df$mG1, na.rm=TRUE)))
#to display the graph request the contents of the variable be shown
gg
qqnorm(df$mG1)
qqline(df$mG1, col=2) #show a line on the plot
pastecs::stat.desc(df$mG1, basic=F)
tpskew<-semTools::skew(df$mG1)
tpkurt<-semTools::kurtosis(df$mG1)
tpskew[1]/tpskew[2]
tpkurt[1]/tpkurt[2]
zmG1<- abs(scale(df$mG1))
FSA::perc(as.numeric(zmG1), 1.96, "gt")
FSA::perc(as.numeric(zmG1), 3.29, "gt")
gs <- ggplot(df, aes(x=df$mG2))
gs <- gs + labs(x="Perceived Stress")
gs <- gs + geom_histogram(binwidth=2, colour="black", aes(y=..density.., fill=..count..))
gs <- gs + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")
gs <- gs + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$mG2, na.rm=TRUE), sd=sd(df$mG2, na.rm=TRUE)))
gs
qqnorm(df$mG2)
qqline(df$mG2, col=2) #show a line on the plot
pastecs::stat.desc(df$mG2, basic=F)
tpskew<-semTools::skew(df$mG2)
tpkurt<-semTools::kurtosis(df$mG2)
tpskew[1]/tpskew[2]
tpkurt[1]/tpkurt[2]
The kurtosis and skew value should be between the range of +/-2. If they are not in this range
we will calculate the AUC .
95% of data is within 1.96 range that means we can say that data is normally distributed.
Type 1 and Type 2 errors are introduced when the results of a hypothesis testing are already known that is we already know if the null hypothesis is true or not true. When the hypothesis testing is performed, there are chances of introducing errors.
Type 1 error would mean that we reject the null hypothesis when it should have been accepted. Ex: in case of a false fire, we would pull the alarm lever.
Type 2 error would suggest that the hypothesis which should have been rejected fails to be rejected. Ex: In case of a fire, we dont pull the fire alarm.
Statistical power is used to detect any effect there is if any in a hypothesis. In statistical hypothesis testing, we assign a null hypothesis against which testing is done. An experiment is then run. We either reject it or fail to reject it.
For instance, the invalid hypothesis for Pearson's connection test is that there is no connection between the two factors. The invalid hypothesis for the Student's t-test is that there is no contrast between the methods for the two population datasets.
Parametric test: If the data can be considered close enough to a normal distribution.
Non Parametric test: If the data cannot be considered close enough to a normal distribution
Correlation defines the relationship between two variables. It measures the extent to which one variable is dependent on the other variable. The extent to which a variable is dependent on the other variable is defined by the Pearsons’ correlation coefficient. Thus value lies in the range -1<0<+1. -1 would indicate that the values are negatively correlated. 0 would suggest no correlation while +1 would indicate a positive. This value gives strong evidence of correlation but doesn’t provide reasons for the casualty.
The formula for Pearsons coefficient is (attached as an image towards right):
Pearson Correlation for Parametric test
Spearman Rank Order Correlation or Kendall’s Tau for Non-Parametric Test
#CORRELATION TEST - PEARSON as data is normally distributed
stats::cor.test(df$mG1, df$mG2, method='pearson')
The relationship between first-period grade and second-period grade for maths was investigated using a Pearson correlation. A strong correlation was found (r =0.88, n=380, p<.001). The relationship between the first-period grade and the second-period grade was investigated using a Pearson correlation. Hence, there is evidence to reject the null hypothesis.
Various tests are performed on data to identify statistical measures. A list of the tests are provided below:
This test helps us find the goodness of fit. In other words, we can figure out the gap between expected and observed values. The definition of this test would be :
“ The test helps us determine the probability of observed frequency of events given an expected frequency”
gmodels::CrossTable(df$Mjob, df$paid.m, fisher = TRUE, chisq = TRUE, expected = TRUE, sresid = TRUE, format = "SPSS")
A Chi-Square test for independence (with Yates’ Continuity Correction) indicated a significant association between mothers’ jobs and extra paid classes within maths subject,χ2 (1,n=4)=9.19,p=.05, phi=0.05
For group = 2
Parametric: Independent T Test (One Tailed, Two Tailed)
Non Parametric: Mann-Whitney U test
This test is used for testing a hypothesis based on sample means. We do not know the standard deviation to find the z value. The different T tests are :
One sample T test: we test the null hypothesis. It finds whether the unknown mean is different from the specific value.
Independent 2 sample T-test: The first step would be to find the means of 2 population samples. This test score is compared and the statistical difference is monitored.
Dependent, paired-sample t-test: the two samples are dependent on each other. Two cases arise here. Firstly, we can test one sample two times. Second, two samples are paired together.
After the test, a t-score is calculated.
As it involves 2 group - Yes/No (Two tailed tests)
Hypothesis
H0: There is no difference in Maths (mG1) scores for students who enrolled for activities.
HA: There is difference in Maths (mG1) scores for students who enrolled for activities.
T-test-will tell whether there is a statistically significant difference between the mean of those who enrolled for activities and those who did not.
#INDEPENDENT T TEST
mg1 = subset(df, select = c("mG1"))
by(df$mG1,df$activities.m,median)
by(df$mG1,df$activities.m,IQR)
psych::describeBy(df$mG1, df$activities.m, mat=TRUE)
car::leveneTest(mG1 ~ activities.m, data=df)
stats::t.test(mG1~activities.m,var.equal=TRUE,data=df)
#No statistically significant difference was found
res <- stats::t.test(mG1~activities.m,var.equal=TRUE,data=df)
#Calculate Cohen's d arithmetically
effcd=round((2*res$statistic)/sqrt(res$parameter),2)
#Using function from effectsize package
effectsize::t_to_d(t = res$statistic, res$parameter)
#Eta squared calculation
effes=round((res$statistic*res$statistic)/((res$statistic*res$statistic)+(res$parameter)),3)
effes
An independent-samples t-test was conducted to compare Math grade first period and extracurricular activities. No significant difference in the scores .For Math grade first period was found (M=10.86, SD= 3.34) for students who had extracurricular activities (M= 11.094, SD= 3.22) and Students who do not have extracurricular activities(M=10.60, SD= 3.47), (t(380)= -1.43, p = 0.15).
Cohen's d also indicated a very small effect size (0.07)
For group > 2
Parametric: ANOVA
Non Parametric: Kruskal-Wallis H test
As it involves >2 group - Father/Mother/Other (ANOVA)
#Conduct ANOVA using the userfriendlyscience test oneway
#In this case we can use Tukey as the post-hoc test option since variances in the groups are equal
#If variances were not equal we would use Games-Howell
userfriendlyscience::oneway(as.factor(df$guardian.m),y=df$mG1,posthoc='Tukey')
res1<-userfriendlyscience::oneway(as.factor(df$guardian.m),y=df$mG1,posthoc='Tukey')
#use the aov function - same as one way but makes it easier to access values for reporting
res2<-stats::aov(mG1~ guardian.m, data = df)
res2
#Get the F statistic into a variable to make reporting easier
fstat<-summary(res2)[[1]][["F value"]][[1]]
fstat
#Get the p value into a variable to make reporting easier
aovpvalue<-summary(res2)[[1]][["Pr(>F)"]][[1]]
aovpvalue
#Calculate effect
aoveta<-sjstats::eta_sq(res2)[2]
aoveta
A one-way- ANOVA between groups analysis of variance was conducted to explore the impact of a guardian on math first-period grade. Participants were divided into three groups father, mother, and others. No statistical difference in the scores were found (M=11.24 sd = 3.22 (father); M=10.83, sd=3.37 (mother); M=9.12, sd = 3.11 (others)) (F(2,379)=2.77,p=0.064)
Differential Effect
The differential effect is defined as a complex character when the relationship between the predicted value ( y) and the predictor value (x) varies across the population.
Interaction Effect
The interaction effect is defined as the relationship an independent variable has on the dependent variable. To understand this effect, we have experiments like Regression and ANOVA.
Dummy Variables
Numerous factors we are keen on expectation are categorical Example, we are keen on the impact of sex or religion or pay class has. Since such values have no scale, it may seem incorrect to think about the impact of a unit of expansion in these as it would for a continuous or categorical variable. We can transform the dummy variable into a numerical variable. Dummy variables are then referred to as indicator variables.
The main idea behind this model is to create the best fit line between two continuous dependent variables. The assumption is to describe a relationship between the variables using a straight line with the form of y=mx+c, where every unit of change in x creates a change in the dependent variable y. Here x is the predictor variable and y is the predicted variable. The correlation coefficient helps us build evidence of a relationship between variables.
To conduct a simple regression, we first need to identify some measures from our dataset.
We then create a histogram, followed by a Q-Q plot, with further understanding of the mean and standard deviation. Before we make a straight relapse model it merits investigating the idea of the connection between the reaction variable and this indicator variable.Look at the scatterplot Calculate the relationship between these two factors.
There are some basic assumptions to test this model:
1)The distribution of errors (SSE) is equal.
2)The independent variable must not exhibit multi correlation
3)Imperative to have a relation between dependent and independent variables.
To test a model, Pearsons’ correlation Coefficient is squared.An R2 of 1 means the independent variable explains 100% of the variance in the dependent variable. Conversely, 0 means it explains none.
The important statistics involved are:
1) F-Statistic
Whether the model as a whole predicts the dependent variable. It's statistical significance is the significance of the model.
2) Regression coefficients (Beta values)
Measure the strength and direction of relationships between independent variables and the dependent variance.
3) Significance scores for the regression coefficients tell us whether the contribution of each variable is statistically significant.
4) R2 statistic or Adjusted R2 Statistic
Measures the model’s overall predictive power and the extent to which the variables explain the variation found in the dependent variable.
#We will allocate the histogram to a variable to allow use to manipulate it
gg <- ggplot(df, aes(x=mG1))
gg <- gg+ggtitle("Histogram for mG1")
#Change the label of the x axis
gg <- gg + labs(x="Marks scored maths 1")
#manage binwidth and colours
gg <- gg + geom_histogram(binwidth=0.1, colour="black", aes(y=..density.., fill=..count..))
gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")
#adding a normal curve
#use stat_function to compute a normalised score for each value of mG1
#pass the mean and standard deviation
#use the na.rm parameter to say how missing values are handled
gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$mG1, na.rm=TRUE), sd=sd(df$mG1, na.rm=TRUE)))
gg
#Create a qqplot
qqnorm(df$mG1, main="Figure 2 - QQ Plot for Normalised Exam Results Age 16")
qqline(df$mG1, col=2) #show a line on the plot
#get summary statistics
mean(df$mG1)
sd(df$mG1)
length(df$mG1)
tpskew<-semTools::skew(df$mG1)
tpkurt<-semTools::kurtosis(df$mG1)
tpskew[1]/tpskew[2]
tpkurt[1]/tpkurt[2]
zmg1<- abs(scale(df$mG1))
FSA::perc(as.numeric(zmg1), 1.96, "gt")
FSA::perc(as.numeric(zmg1), 3.29, "gt")
#We will allocate the histogram to a variable to allow use to manipulate it
gg <- ggplot(df, aes(x=mG2))
gg <- gg+ggtitle("Histogram for mG2")
#Change the label of the x axis
gg <- gg + labs(x="marks for maths2")
#manage binwidth and colours
gg <- gg + geom_histogram(binwidth=0.1, colour="black", aes(y=..density.., fill=..count..))
gg <- gg + scale_fill_gradient("Count", low="#DCDCDC", high="#7C7C7C")
#adding a normal curve
#use stat_function to compute a normalised score for each value of mG2
#pass the mean and standard deviation
#use the na.rm parameter to say how missing values are handled
gg <- gg + stat_function(fun=dnorm, color="red",args=list(mean=mean(df$mG2, na.rm=TRUE), sd=sd(df$mG2, na.rm=TRUE)))
#to display the graph request the contents of the variable be shown
gg
#Create a qqplot
qqnorm(df$mG2, main="QQ Plot for mG2")
qqline(df$mG2, col=2) #show a line on the plot
mean(df$mG2)
sd(df$mG2)
length(df$mG2)
tpskew<-semTools::skew(df$mG2)
tpkurt<-semTools::kurtosis(df$mG2)
tpskew[1]/tpskew[2]
tpkurt[1]/tpkurt[2]
zmg2<- abs(scale(df$mG2))
FSA::perc(as.numeric(zmg2), 1.96, "gt")
FSA::perc(as.numeric(zmg2), 3.29, "gt")
#Explore relationship between mG1 and mG2
#Simple scatterplot of mG1 and mG2
#aes(x,y)
scatter <- ggplot2::ggplot(regression, aes(mG1, mG2))
#Add a regression line
scatter + geom_point() + geom_smooth(method = "lm", colour = "Red", se = F) + labs(x = "Students score maths1", y = "Students score maths2")
Variable of Interest: mg2(Dependent variable), mg1(Independent variable)
Hypothesis
H0: There is a relation between mg1 and mg2.
HA: There is no relation between mg1 and mg2.
NOTE
Multiple R-squared: 0.7904 is the proportion of the variance for the variable mg2. It means mg1 explains 79% variance in the mg2.
Equation: mg2 = 1.68410+0.86567*mg1
This model is an extension of the linear regression model. We have a dependent variable (y) relying on several independent variables.
Ex: Net profit of a company may depend on the price, brand, sales of its products.
In the case of multiple regression, some independent variables might be correlated, hence this model can experience multicollinearity. It is not possible to adjust one variable without having any effect on the other.
Multiple Linear Regression - mG2 predicted by mG1 including dummy variable for paid.m to investigate a differential effect
model2<-lm(df$mG2~df$mG1+df$paid.m)
anova(model2)
Equation: mg2 = 1.699+0.866*mg1+(-0.037*paid.m)
Including one more variable internet to the above model
model3<-lm(df$mG2~df$mG1+df$paid.m+df$internet)
anova(model3)
summary(model3)
Equation: mg2 = 1.264+(0.862*mg1)+(-0.099*paid.m)+(0.602*internet)
This model doesn’t have a prediction potential. It is used to just identify the probability of a point belonging to a particular class or category. We classify the samples we have based on their probability of occurrence in a particular category. The output can be either yes or no. Hence the solutions can be [0,1]. This is also known as Binary Logistic Regression. Suppose we have two input variables x1 and x2, consider a point in the space (a,b) where a is the value of x1 and b is the value of x2
Our equation becomes: b0 + b1a +b2b.
There are 3 conditions now:
The point (a, b) lies in the positive boundary, it can be on the negative boundary or it could lie on the linear boundary itself.
To map the outcome of this relationship to the probability we use the ODDS RATIO which is defined by:
OR(X)= P(X) / 1- P(X)'
The baseline/null model is the baseline comparator to which any model is compared. Predictions of this baseline model in this case is made purely on whichever category occurred most often in our dataset.
We aim to improve this prediction by including our additional variables. Omnibus trial of the model is utilised to watch that the new model (with logical factors included) is an improvement over the gauge model.It utilizes chi-square tests to check whether there is a huge contrast between the standard model (invalid) and the model made.
Assumptions made for this model
1)Requires the dependent variable to be binary/nominal with multiple categories.
2)The observations must be independent of each other.
3)There must be no correlation between independent variables
4) Typically requires a large sample size.
Variable of Interest: paid.m(Dependent variable), famsup.m(Independent variable), Fjob(Independent variable)
Hypothesis
H0: There is no significant prediction of student opting for paid extra classes using father's job and family support.
HA: There is a significant prediction of student opting for paid extra classes using father's job and family support.
df$Fjob = as.factor(df$Fjob)
df$Mjob = as.factor(df$Mjob)
df$famsup.m = as.factor(df$famsup.m)
df$paid.m = as.factor(df$paid.m)
df$famsup.m = as.factor(df$famsup.m)
df$schoolsup.m = as.factor(df$schoolsup.m)
logmodel1 <- glm(paid.m ~ famsup.m+Fjob, data = df, na.action = na.exclude, family = binomial(link=logit))
#Full summary of the model
summary(logmodel1)
Including one more variable Mjob to the above model
logmodel2 <- glm(paid.m ~ famsup.m+Fjob+Mjob, data = df, na.action = na.exclude, family = binomial(link=logit))
#Full summary of the model
summary(logmodel2)
## odds ratios
cbind(Estimate=round(coef(logmodel2),4), OR=round(exp(coef(logmodel2)),4))
regclass::confusion_matrix(logmodel2)
#Check the assumption of linearity of independent variables and log odds using a Hosmer-Lemeshow test, if this is not statistically significant we are ok
generalhoslem::logitgof(df$paid.m, fitted(logmodel2))
The Hosmer Lemeshow goodness of fit statistic revealed no problems with the assumption of linearity between the independent variables and the model's log chances.
vifmodel<-car::vif(logmodel2)#You can ignore the warning messages, GVIF^(1/(2*Df)) is the value of interest
vifmodel
1/vifmodel
The tolerance and variance influence factor measurements were within acceptable limits VIF < 2.5), Tarling(2008).
Multinomial logistic regression analysis was conducted with a student’s perform dataset as the outcome variable (extra paid classes within the course- maths ) with student well-being, mature student family education support, fathers job, mothers job being attended as predictors. The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as Outlined in Tarling (2008). The Hosmer Lemeshow goodness of fit statistic did not indicate any issues with the assumption of linearity between the independent variables and the log odds of the model (χ2(n=7)=3.278, p =0.85).
The concept of dimensionality reduction is introduced to introduce unwanted multicollinearity between variables. The goal is to obtain the right number of variables which can be achieved by dimensional analysis. There are quite a few variables that do not have any scale of measurement, called Latent variables. Manifest variables are required to work with them.
The approach of Principle component analysis comes into the picture. This method is used when working with a set of uncorrelated variables which has been well established that
they measure the same underlying component.
We start off with some variables. After the process of dimensionality reduction, we end up with a smaller group of variables which are a representation of the original dataset.
The procedure would be to first look for correlations in the Correlation matrix.
In Factor Analysis and PCA, we look to reduce the R- matrix into a smaller set of uncorrelated dimensions. In the end, we would have 2 clusters which are demonstrating either a positive trait or a negative trait. PCA reduces A set of highly correlated manifest variables to a set of unrelated components.
Arranged in descending order of importance. We can replace our manifest variables with our components.
Assumptions to perform Dimensionality reduction would be to check if there exists a linear relationship between latency variable and manifest variable. The procedure to follow :
Generate a correlation matrix: Test variables must correlate with each other, we must avoid multicollinearity and singularity, eliminate variables that cause concerns.
Check if the data is suitable: This can be achieved with the Bartlett Test of Sphericity. Tests that your variables are correlated. Compares the correlation matrix with a matrix of zero correlations. We should look for p-value which suggests correlations. We then generate relevant statistics like Kaiser-Meyer-Olkin Measure of Sampling Adequacy.
The last step is to then do the dimension reduction: We need to check which variables are relevant and must be kept. We calculate the eigenvalues. Kaiser recommended keeping variables with eigenvalues greater than 1.
NOTE : To understand these output please refer the R script attached in the R script section.
LINEAR REGRESSION REPORT:
A multiple regression was carried out to investigate whether first-period grade(maths), extra paid classes with maths, and internet access at home could significantly predict second-period grade for maths. Examination of the histogram, normal P-P plot of standardised residuals, and the scatterplot of the dependent variable, exam score, and the standardised residuals showed that some outliers existed. However, the examination of the standardized residuals showed that none could be considered to have undue influence (95% within limits of -1.96 to plus 1.96 and none with Cook’s distance >1 as outlined in Field (2013). The scatterplot of standardised residuals showed that the data met the assumptions of homogeneity of variance and linearity. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as outlined in Tarling (2008). The data also meets the assumption of non-zero variances of the predictors.
LOGISTIC REGRESSION REPORT:
Multinomial logistic regression analysis was conducted with a student’s perform dataset as the outcome variable (extra paid classes within the course- maths ) with student well-being, mature student family education support, fathers job, mothers job being attended as predictors. The data met the assumption for independent observations. Examination for multicollinearity showed that the tolerance and variance influence factor measures were within acceptable levels (tolerance >0.4, VIF <2.5 ) as Outlined in Tarling (2008). The Hosmer Lemeshow goodness of fit statistic did not indicate any issues with the assumption of linearity between the independent variables and the log odds of the model (χ2(n=7)=3.278, p =0.85).
DIMENSION REDUCTION REPORT:
Principal component analysis (PCA) was conducted on the 26 items with orthogonal rotation (varimax). Bartlett’s test of sphericity, Χ2(325)= 15019.39, p< .001, indicated that correlations between items were sufficiently large for PCA. An initial analysis was run to obtain eigenvalues for each component in the data. Four components had eigenvalues over Kaiser’s criterion of 1 and in combination explained 36.46% of the variance. The screen plot was slightly ambiguous and showed inflections that would justify retaining either 2 or 4 factors. Given the large sample size, and the convergence of the scree plot and Kaiser’s criterion on four components, four components were retained in the final analysis. Component 1 represents grades in maths and Portuguese, component 2 Alcohol consumption(weekly and daily), component 3 free time after school and going out with friends, and component 4 is the current health status. The grades in maths and Portuguese and alcohol consumption, all Cronbach’s α = .85. The free time after school and going out with friends of Cronbach’s α = .74. The current health status had, Cronbach’s α= .99
CHI SQUARE:
A Chi-Square test for independence (with Yates’ Continuity Correction) indicated a significant association between mothers’ jobs and extra paid classes within maths subject,χ2 (1,n=4)=9.19,p=.05, phi=0.05
INDEPENDENT T TEST:
An independent-samples t-test was conducted to compareMath grade first period and extracurricular activities. No significant difference in the scores for Math grade first period was found (M=10.86, SD= 3.34) for respondents who had extracurricular activities (M= 11.094, SD= 3.22) and Students who do not have extracurricular activities(M=10.60, SD= 3.47), (t(380)= -1.43, p = 0.15).Cohen's d also indicated a very small effect size (0.07).
ONE WAY ANOVA:
A one-way- ANOVA between groups analysis of variance was conducted to explore the impact of a guardian on math first-period grade. Participants were divided into three groups father, mother, and others. No statistical difference in the scores were found (M=11.24, sd=3.22 (father); M=10.83, sd=3.37 (mother); M=9.12, sd=3.11 (others)) (F(2,379)=2.77,p=0.064)
CORRELATION:
Principal component analysis (PCA) was conducted on the 26 items with orthogonal rotation (varimax). Bartlett’s test of sphericity, Χ2(325)= 15019.39, p< .001, indicated that correlations between items were sufficient The relationship between first-period grade and second-period grade for maths was investigated using a Pearson correlation. A strong correlation was found (r =0.88, n=380, p<.001). The relationship between the first-period grade and the second-period grade was investigated using a Pearson correlation. Hence, there is evidence to reject the null hypothesis.