08 Statistics for Single Variable
Frequency Distribution for Categorical Variables
Load sample data ... we will use diamonds data from ggplot2 package
> diamonds = ggplot2::diamonds
Look at the type of "cut" column.
Frequency for cuts.
Probability Density for each species
You can find frequency for each category using summary function as well.
Descriptive Statistics for Quantitative Variables
Find number summary for numeric type using summary function
More statistics on numeric data using psych package
T-test - Hypothesis Testing
> t.test(diamonds$price, alternative = "greater", mu = 3900)
Descriptive Statistics for Qualitative Variables
Chi-square test: values span from 0 to inf. 0 signifies good fit to the bucket proportion.
One-dimensional goodness of fit test. It measure the distribution of categorical variables. Compare chi-square values for follow 2 scenarios.
A robust statistic is resistant to errors in the results, produced by deviations from assumptions.
The median is a robust measure of central tendency, while the mean is not. The median has a breakdown point of 50%, while the mean has a breakdown point of 0% (a single large observation can throw it off).
The median absolute deviation and interquartile range are robust measures of statistical dispersion, while the standard deviation and range are not.
Robust statistical analysis packages in R: https://cran.r-project.org/web/views/Robust.html
For illustration, let's use built in rivers dataset.
Look at 5 point statistics
Min. 1st Qu. Median Mean 3rd Qu. Max.
135.0 310.0 425.0 591.2 680.0 3710.0
The gap between mean and median indicates that there are some outliers data at the higher range of values. To verify, do a boxplot.
> boxplot(rivers, ylab = "Length of river", main = "Boxplot of Lengths of Major North American Rivers")
The above boxplot shows the outliers.
Now draw a histogram, to view the same outliers
> hist(rivers, ylab = "Length of river", main = "Histogram of Lengths of Major North American Rivers")
You can view the outliers by using boxplot.stats function.
Measure of Central Tendency:
Mean: Non robust, because, the value is impacted by even a single outlier
Median: Robust, values are resistant towards the outliers.
Trimmed Mean: you can make mean robust by eliminating extreme values from the statistics.
> mean(rivers, trim = 0.05)
Measure of Dispersion:
Standard Deviation: non robust, it is the average deviation from the mean. Like mean, it is affected by outliers.
Median Absolute Deviation: robust, median of the absolute deviations from the median.
Range: non-robust, min and max
Inter Quartile Ranges: robust