R: Statistical Analysis and Visual Display
Introduction
Red Hen's Edge and Edge2 Search Engines and her command-line tools (peck and family) produce output in comma-separated files designed to be imported straight into R.
A syllabus for learning to use the statistical software package R on Red Hen data is available at https://docs.google.com/document/d/1tP3j3ftw_d_iqdpLNnl5M53AhaHFcdz0QpHR4BvprP4/edit
Resources
- R, the main free software statistics package
- littler -- for scripting R, see examples
- Wordfish -- extract political positions (written in R; used by Tobias Konitzer 2014-12)
- julia -- a fast new statistics language, see the D3 visualizations
- Open Datasets in .csv form that you can use for practice
- https://github.com/awesomedata/awesome-public-datasets
- https://www.kaggle.com/datasets?sortBy=updated&group=featured&search=tag%3A%27linguistics%27
- https://www.kaggle.com/mohansacharya/graduate-admissions
- https://data.cityofchicago.org/Transportation/Red-Light-Camera-Violations/spqx-js37/data can be downloaded as CSV by pressing the export button
- https://datawrapper.dwcdn.net/cZMp8/4/
- https://catalog.data.gov/dataset/age-adjusted-death-rates-for-the-top-10-leading-causes-of-death-united-states-2013
- https://data.cityofnewyork.us/City-Government/Citywide-Payroll-Data-Fiscal-Year-/k397-673e
https://catalog.data.gov/dataset/animal-care-and-control-adopted-animals
- https://catalog.data.gov/dataset?res_format=CSV
- https://data.cityofnewyork.us/Education/2014-2015-CoLocation-Reporting/phqd-xav8
- https://github.com/awesomedata/awesome-public-datasets/blob/master/Datasets/titanic.csv.zip
https://www.kaggle.com/uciml/iris
- https://data.cdc.gov/api/views/xkb8-kh2a/rows.csv?accessType=DOWNLOAD (provisional counts for drug overdose deaths based on a current flow of mortality)
- https://www.kaggle.com/residentmario/ramen-ratings/version/1
- https://catalog.data.gov/dataset/lottery-mega-millions-winning-numbers-beginning-2002
Collaboration resources
Packages and extensions
- tm -- text mining package, with lots of links
- Quick-and-dirty Sentiment Analysis in Ruby + R
- 2011-10-08 Visualizing GIS data with R and Open Street Map
- OpenCPU -- R web applications
- Jeroen Ooms
Development framework
- RStudio -- R development framework
- shiny -- building web applications using R
- shiny examples
- shiny on github
Tutorials
- An Introduction to R
- UCLA ATS Resource page for R
- Using R for Introductory Statistics
- http://www.statmethods.net/management/operators.html
- http://cran.r-project.org/doc/manuals/R-data.html
- http://www.cyclismo.org/tutorial/R/input.html
- http://cran.r-project.org/bin/macosx/RMacOSX-FAQ.html#Interactive-mode
- http://msenux.redwoods.edu/math/R/simple.php
- http://addictedtor.free.fr/graphiques/
- http://www.r-project.org
- http://www.johnmyleswhite.com/notebook/2009/12/18/using-complex-numbers-in-r/
- http://www.gardenersown.co.uk/Education/Lectures/R/correl.htm#cor_step
- http://www.statmethods.net/graphs/index.html
- http://www.harding.edu/fmccown/r/
- http://msenux.redwoods.edu/math/R/boxplot.php
Using R
Start, help, and end
R # start R in Linux (installed on Roma)
?keyword # get help on a function (guessing is ok)
?plot(type) # get help on a parameter within a function
q() # quit
Install packages
Installed by root on cartago (available to all)
> install.packages("ggplot2")
> install.packages("histogram")
> install.packages("tm") -- the text mining library
> install.packages("stm") -- Estimation of the Structural Topic Model (topic modeling)
Import files
?read.csv # help
Set working directory (can also be set in /etc/R/Rprofile.site)
> setwd("/home/csa/Pattern2.6")
Read a csv file
df <- read.csv(inFile, header=TRUE, comment.char="#", stringsAsFactors=FALSE)
> eb<-read.csv("/Users/steen/tna/csv/Erin_Burnett.csv",header=TRUE)
> ebpositive = eb$Positive
> ebsubjective = eb$Subjective
> ng<-read.csv("/Users/steen/tna/csv/Nancy_Grace.csv",header=TRUE)
> ngpositive = ng$Positive
> ngsubjective = ng$Subjective
> List3 <- read.table("file://tmp/list3", sep=" ", header=TRUE) # space-separated table
> heat <-read.csv("/Users/steen/tna/csv/2014-07-22_Heatmap-sentiment.csv",header=TRUE)
> SMT1Positive = heat$SMT_01_positivity
> library(ggplot2)
> library(scales)
> heatdates <- as.Date(eb$Date)
In little r (from Dirk Edelbuettel)
$ r -e'dat <- read.csv("/tmp/david.csv"); \
res <- data.frame(mn1=mean(dat[,10]),sd1=sd(dat[,10]),mn2=mean(dat[,11]),sd2=sd(dat[,11])); print(res)' -
Remove objects
> rm(days,mydates,sdate3)
select variables v1, v2, v3
myvars <- c("v1", "v2", "v3")
newdata <- mydata[myvars]
another method
myvars <- paste("v", 1:3, sep="")
newdata <- mydata[myvars]
select 1st and 5th through 10th variables
newdata <- mydata[c(1,5:10)]
Select a column
> heat <- read.csv("/Users/steen/tna/csv/2014-07-22_Heatmap-sentiment.csv", header = TRUE, as.is = TRUE)
> heat20040603 <- subset(heat,Date <= "2004-06-03") # Works if you use as.is = TRUE
> FaF <- subset(heat,Show = "Fox_and_Friends") # Does not work
> library(sqldf)
> FaF <- read.csv.sql(heat, sql = 'select * from file where Show = "Fox_and_Friends"') # Fails
Numerical ranges
newdata <- subset(mydata, age >= 20 | age < 10, select=c(ID, Weight))
Select rows by value range and range of columns by order
newdata <- subset(mydata, sex=="m" & age > 25, select=weight:income)
CNN <- subset(heat, Network=="CNN", select=Date:SMT_01_positivity)
cnnebofhm<- subset(hm, Network=="CNN" & Show=="Erin_Burnett_Out_Front")
Show data
Getting information on a dataset
Show objects
> ls()
Show variables / headers
> names(heat)
[1] "Date" "Hour" "Country"
[4] "Network" "Show" "Caption_wc"
[7] "SMT_01_wc" "SMT_01_positivity" "SMT_01_subjectivity"
[10] "SMT_02_wc" "SMT_02_positivity" "SMT_02_subjectivity"
Show third column name
> colnames(heat)[3]
Data structure (useful for spotting inconsistencies in the data)
> str(heat)
'data.frame': 230893 obs. of 12 variables:
$ Date : Date, format: "2004-06-01" "2004-06-02" ...
$ Hour : int 1300 1300 1300 1300 1300 1300 1300 1300 1300 1300 ...
$ Country : Factor w/ 13 levels "AF","BE","CA",..: 13 13 13 13 13 13 13 13 13 13 ...
$ Network : Factor w/ 39 levels "AlJazeera","BBC",..: 39 39 39 39 39 39 39 39 39 39 ...
$ Show : Factor w/ 1873 levels "1600_Pennsylvania_Avenue",..: 442 442 442 442 442 442 442 ...
$ Caption_wc : int 5336 6946 8908 5366 7538 6582 7410 6261 6778 4122 ...
$ SMT_01_wc : int 331 467 636 354 532 427 430 448 474 249 ...
$ SMT_01_positivity : num 24.4 10.5 49.5 34.3 35.2 ...
$ SMT_01_subjectivity: num 138 202 273 174 219 ...
$ SMT_02_wc : int 1019 1364 1778 1032 1529 1291 1384 1237 1318 733 ...
$ SMT_02_positivity : num 13 -26.38 8.38 -1.25 2 ...
$ SMT_02_subjectivity: num 185 266 358 222 276 ...
Print first column
> heat[1]
Print second and fourth column
> heat[c(2,4)]
Print first 10 rows
> head(heat, n=10)
Print last 10 rows of second and fourth column
> tail(heat[c(2,4)], n=10)
(today <- Sys.Date())
format(today, "%d %b %Y") # with month as a word
(tenweeks <- seq(today, length.out=10, by="1 week")) # next ten weeks
weekdays(today)
months(tenweeks)
Convert a date column to dates
heat$Date <- as.Date(heat$Date , "%Y-%m-%d")
Show first date
> head(heat[1], n=1)
Date
1 2004-06-01
Show last date
> tail(heat[1], n=1)
Date
230893 2014-07-20
Show days between two dates
mydates <- as.Date(c("2007-06-22", "2004-02-13"))
days <- mydates[1] - mydates[2]
Basic functions
Rename a column
> colnames(heat)[3] <- "Network"
For loops
for (x in c(1:10)) print(sqrt(x))
for (x in c(0:400)) print(x^2))
History
history()
savehistory(file = "name")
loadhistory(file = "name")
Mean, var, sd
On the fly from a file
dat <- read.csv("smt2csv-02.py-output.csv", header=FALSE)
res <- data.frame(mn1=mean(dat[,10]),sd1=sd(dat[,10]),mn2=mean(dat[,11]),sd2=sd(dat[,11]))
On the fly from stdin, using little r
cat smt2csv-02.py-output.csv | r -e'dat <- read.csv(file("stdin")); \
res <- data.frame(mn1=mean(dat[,10]),sd1=sd(dat[,10]),mn2=mean(dat[,11]),sd2=sd(dat[,11])); print(res)' -
Shortened version from 2014-07-26 update:
cat smt2csv-02.py-output.csv |\ r -d -e'\
res <- data.frame(mn1=mean(dat[,10]),sd1=sd(dat[,10]),mn2=mean(dat[,11]),sd2=sd(dat[,11])); print(res)' -
Individually
> mean(ebpositive)
[1] 0.03487137
> var(ebpositive)
[1] 0.1058562
> sd(ebpositive)
[1] 0.3253556
> mean(ebsubjective)
[1] 0.2791462
> var(ebsubjective)
[1] 0.0977887
> sd(ebsubjective)
[1] 0.3127119
> mean(ngpositive)
[1] 0.02013443
> var(ngpositive)
[1] 0.1135885
> sd(ngpositive)
[1] 0.3370289
> mean(ngsubjective)
[1] 0.2875757
> var(ngsubjective)
[1] 0.09846148
> sd(ngsubjective)
[1] 0.3137857
Boxplot
> boxplot(eb$Positive, ng$Positive, col="lightblue")
> boxplot(eb$Subjective, ng$Subjective, col="red")
Histogram
> library(histogram)
> histogram(eb$Positive)
Scripting
Rscript
Rscript is included in R and is the default method of scripting. You can run Rscripts from the commandline or from a bash script.
littler
Little r is an alternative and more flexible scripting system for R written by Dirk Edelbuettel. With r, you can do things like read piped output; you can also run r commands straight inside bash scripts.
Obsolete style -- run a script silently or verbosely
R --vanilla --slave <delay.R # or use --save to keep history
R --vanilla --verbose <faculty-teaching.R # get some debugging information
Vectors
Define a vector and display the contents
x <- c(0:401)
x
Generate a vector with cumulative values from a vector or a matrix
Cumulative <- cumsum(x)
Cumulative <- cumsum(WordCount$V2)
Find vector names in a matrix
names(List3)
Echo the content of a vector in a matrix
List3$V1
Pair vectors from two matrices
TwoColumns <- paste(Record$V1,Delay$V1)
Set a default matrix
attach(List3)
Plot
- Graphics examples (very helpful)
Plot vectors
plot(x,sample(x)) # plot the vector against a randomized version of itself
plot(x,sample(x)^20,xlab="x",ylab="y",main="This is my first Rtwork")
plot(y=Delay$V1,x=Record$V1)
Plot the data
plot(soc$N, (soc$N * soc$Students),
ylab="Cumulative enrollment over the last four quarters",
xlab="Number of classes taught",
col.axis = "sky blue",
col.lab = "grey45",
pch = 19, # plotting character -- 19 is a circle
cex = 1, # or cex 1:10/5 if you want increasing sizes
col = green, # If you want green circles for data
# type = "n", # If you don't want to plot points yet
xlim = c(0,7), # Set the x-axis scale
ylim = c(0,550)) # It may be possible to use
Add a title, using legend colors green and blue
mtext(c("Ladder faculty teaching in ", "Comm", " and ", "Soc"),
cex = 1.5, # scale text size
col = c(1,4,1,3), # 0 white, 1 black, 2 red, 3 green, 4 blue
at = c(2,4.4,5.27,5.9), # clumsy solution -- must be manually adjusted
line = 2)
Add a subtitle
mtext(c("2008-2009 undergraduate classes, excluding sections and fiat lux"), line = 0.7)
Add labeled data points (offset left and right)
text ( (soc$N - 0.3), (soc$N * soc$Students), labels = soc$Name, col = "green" )
text ( (com$N + 0.3), (com$N * com$Students), labels = com$Name, col = "blue" )
Add points to an existing plot
points(soc$N, (soc$N * soc$Students))
Get yesterday's date
day <- Sys.Date() - 1
Format the date
day <- format(day, "%d %B %Y")
Add the formatted date to the title
title <- bquote(bold(paste("Postprocessing times on ", .(day))))
Admittedly a terrible syntax for adding a variable to a string
var <- bquote(paste("string", .(var)))
Plot a vector without x-axis labels and add the title string and color
plot(Delay$V1, xaxt="n", xlab="Time of recording", ylab="Delay in minutes",
main=title, col="blue")
Add the times into the x-axis lables
axis(1, at=1:(length(Delay$V1)), labels=Record$V1, tick=FALSE, col="red")
Cumulative word count with area graph
plot(cumsum(WordCount$V2), xaxt="n", xlab="Dates", ylab="Words", main=title,
pch = 19, cex = -1 , col = "green", type="h")
Histogram
hist(V1,ylab="Frequency",xlab="Delay in minutes", main=title)
Write
x <- matrix(1:10,ncol=5)
write(t(x)) -- writes x to a file called data (default file name)
write(x, "", sep = "\t") -- writes x tab-delimited to the screen (not to file)
write(x, "myfile", sep = "\t") -- writes x tab-delimited to the file myfile
Cool R ideas
MatchIt
- Seltzer 2011 R tutorial (Propensity Score Matching with MatchIt)
maps
- We can use R for geotagging -- ask on the r-sig-geo maillist
- 2011-10-08 Visualizing GIS data with R and Open Street Map
- Geographic maps in R
tag cloud
access Amazon, NYT, Google Trends
flash output
- R graphics device for swf (flash) output -- still alpha, monitor test suite
audio search
- How shazam identifies a song from a noisy ten-second clip -- this may be a method to identify a video clip (or the video clip through an audio track)
R Documentation
Reference
Tutorials
- A great place to start, including Basic plots
- UCLA Statistics Collective Knowledge
- Simple charts
- How to plot multiple series (and label with dates)
- Learning R A systematic series of ggplot2 lessons (ggplot2 is installed on roma)
Blogs
- Using R blog with usage ideas
- Yihui's blog with animations and usage examples
- Revolution R blog with release news
- A script example
Installing R
Many R components have been packaged for Debian, starting with r-base-core, but packages can also be downloaded directly from within R itself, and install into /usr/local/lib/R/site-library. Cran download site USA (CA 2) is http://cran.stat.ucla.edu.
Debian packages
Types of R packages in Debian:
r-base*
r-block*
r-cran*
r-doc*
r-mathlib
r-other*
Recommended for installation:
r-recommended
animation
install.packages("animation")
ggplot2
install.packages("ggplot2")
- ggplot2, next-gen R graphics (improves on ggplot, which maybe improves on gplots)
- Learning R -- a series of ggplot2 lessons