r

R

http://www.noitulove.ch/2008/07/03/learning-r-part-i/

http://rattle.togaware.com Rattle: Gnome R Data Mining

http://www.statmethods.net/index.html

ROOT http://root.cern.ch/drupal/

> x=c(1,2,3)

> sum(x)

[1] 6

> mean(x)

[1] 2

> length(x)

[1] 3

> ls()

[1] "weight" "x"

> objects()

[1] "weight" "x"

> 1/x

[1] 1.0000000 0.5000000 0.3333333

> y=2*x

> y

[1] 2 4 6

> plot(x,y)

> x=1:25

> y=sqrt(x)

> plot(x,y)

# add line to current graph

>lines(x,x^2)

>lines(x,log(x))

abline

curve

edit data

> data.entry(x) # Pops up spreadsheet to edit data

> x = de(x) # same only, doesn't save changes

> x = edit(x) # uses editor to edit x.

barpot

> x=c(1,2,2,3,3,3,4,4,4,4)

> barplot(table(x)) #barplot of frequencies

> barplot(table(x)/length(x))

> x=seq(0,4,by=.1) # create the x values

> plot(x,x^2,type="l") # type="l" to make line

> curve(x^2,0,4)

Normal Distribution

f(x) = 1/(sqrt(2 pi) sigma) e^-((x - mu)^2/(2 sigma^2))

dnorm(x, mean=0, sd=1, log = FALSE)

pnorm(q, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)

qnorm(p, mean=0, sd=1, lower.tail = TRUE, log.p = FALSE)

rnorm(n, mean=0, sd=1)

x,q vector of quantiles.

p vector of probabilities.

n number of observations. If length(n) > 1, the length is taken to be the number required.

mean vector of means.

sd vector of standard deviations.

log, log.p logical; if TRUE, probabilities p are given as log(p).

lower.tail logical; if TRUE (default), probabilities are P[X <= x], otherwise, P[X > x].

x=seq(-4,4,0.1)

plot(x,dnorm(x),type="l")

or curve(dnorm(x),from=-4,to=4)

density and cumulative distribution on same graph

x=-10:10

plot(x, pnorm(x),type="l")

lines(x, dnorm(x),type="l")

> x=rnorm(100)

> hist(x,freq=F)

> curv(dnorm(x),add=T)

x=-10:10

plot(x, pnorm(x),type="l")

plot(x, dnorm(x),type="l")

Binomial distribution

> x=0:50

> dbinom(x,size=50,prob=0.33)

plot(x, dbinom(x,size=50,prob=0.33), type="h")

Functions are provided to evaluate the cumulative distribution function P(X <= x), the probability density function and the quantile function (given q, the smallest x such that P(X <= x) > q), and to simulate from the distribution.

Prefix the name given here by `d' for the density, `p' for the CDF, `q' for the quantile function and `r' for simulation (random deviates). The first argument is x for dxxx, q for pxxx, p for qxxx and n for rxxx (except for rhyper and rwilcox, for which it is nn). In not quite all cases is the non-centrality parameter ncp are currently available: see the on-line help for details.

The pxxx and qxxx functions all have logical arguments lower.tail and log.p and the dxxx ones have log. This allows, e.g., getting the cumulative (or “integrated”) hazard function, H(t) = - log(1 - F(t)), by

- pxxx(t, ..., lower.tail = FALSE, log.p = TRUE)

or more accurate log-likelihoods (by dxxx(..., log = TRUE)), directly.

Reading from file

> HousePrice <- read.table("c:\\floor.txt")

> HousePrice

Price Floor Area Rooms Age Cent.heat

01 52.00 111 830 5 6.2 no

02 54.75 128 710 5 7.5 no

03 57.50 101 1000 5 4.2 no

04 57.50 131 690 6 8.8 no

05 59.75 93 900 5 1.9 yes

> summary(HousePrice)

Price Floor Area Rooms Age Cent.heat

Min. :52.00 Min. : 93.0 Min. : 690 Min. :5.0 Min. :1.90 no :4

1st Qu.:54.75 1st Qu.:101.0 1st Qu.: 710 1st Qu.:5.0 1st Qu.:4.20 yes:1

Median :57.50 Median :111.0 Median : 830 Median :5.0 Median :6.20

Mean :56.30 Mean :112.8 Mean : 826 Mean :5.2 Mean :5.72

3rd Qu.:57.50 3rd Qu.:128.0 3rd Qu.: 900 3rd Qu.:5.0 3rd Qu.:7.50

Max. :59.75 Max. :131.0 Max. :1000 Max. :6.0 Max. :8.80

> area=HousePrice$Area #access to individual column

> mean(area)

[1] 826

> sd(area)

[1] 130.1153

> price=HousePrice$Price

> cor(area,price)

[1] 0.2982011

Linear Regression

> l1 = lm(price ~ area)

> summary(l1)

Call:

lm(formula = price ~ area)

Residuals:

1 2 3 4 5

-4.327377 -0.756054 0.009082 2.130833 2.943517

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 50.646559 10.550880 4.800 0.0172 *

area 0.006844 0.012649 0.541 0.6260

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.292 on 3 degrees of freedom

Multiple R-Squared: 0.08892, Adjusted R-squared: -0.2148

F-statistic: 0.2928 on 1 and 3 DF, p-value: 0.626

> l2 = lm(price ~ +I(sin(2*pi*area)) +I(cos(2*pi*area)))

> summary(l2)

> library()

Packages in library 'C:/PROGRA~1/R/R-2.6.0pat/library':

base The R Base Package

boot Bootstrap R (S-Plus) Functions (Canty)

class Functions for Classification

cluster Cluster Analysis Extended Rousseeuw et al.

codetools Code Analysis Tools for R

datasets The R Datasets Package

foreign Read Data Stored by Minitab, S, SAS, SPSS, Stata, Systat, dBase, ...

graphics The R Graphics Package

grDevices The R Graphics Devices and Support for Colours and Fonts

grid The Grid Graphics Package

KernSmooth Functions for kernel smoothing for Wand & Jones (1995)

lattice Lattice Graphics

MASS Main Package of Venables and Ripley's MASS

methods Formal Methods and Classes

mgcv GAMs with GCV smoothness estimation and GAMMs by REML/PQL

nlme Linear and Nonlinear Mixed Effects Models

nnet Feed-forward Neural Networks and Multinomial Log-Linear Models

rcompgen Completion generator for R

rpart Recursive Partitioning

spatial Functions for Kriging and Point Pattern Analysis

splines Regression Spline Functions and Classes

stats The R Stats Package

stats4 Statistical Functions using S4 Classes

survival Survival analysis, including penalised likelihood.

tcltk Tcl/Tk Interface

tools Tools for Package Development

utils The R Utils Package

rle, filter, which

seq1=c(1,2,3,2)

seq2=sort(seq1)

rle(seq2)

Run Length Encoding

lengths: int [1:3] 1 2 1

values : num [1:3] 1 2 3

nums = c(12,9,8,14,7,16,3,2,9)

nums[nums>10]

[1] 12 14 16

> which(nums>10)

[1] 1 4 6

Matrix

> x = matrix(1:12,4,3)

> x

[,1] [,2] [,3]

[1,] 1 5 9

[2,] 2 6 10

[3,] 3 7 11

[4,] 4 8 12

#mean of each row of a matrix

rowMeans(x)

[1] 5 6 7 8

#mean of each row of a matrix

apply(x,1,mean) #1 means row here

[1] 5 6 7 8

> x[,1] #first column

[1] 1 2 3 4

> x[,c(3,1)] #3rd and 1st columns

[,1] [,2]

[1,] 9 1

[2,] 10 2

[3,] 11 3

[4,] 12 4

> x[2,] #second row

[1] 2 6 10

> x[10]

[1] 10

sum(x[1,]) # sum of first row

> apply(x, 1, sum) # by row

[1] 22 26 30

> apply(x, 2, sum) # by column

Autoregressive models http://home.ubalt.edu/ntsbarsh/Business-stat/stat-data/Forecast.htm

The autoregressive model is one of a group of linear prediction formulas that attempt to predict an output of a system based on the previous outputs and inputs, such as:

Y(t) = b₁ + b₂Y(t-1) + b₃X(t-1) + e_t,

where X(t-1) and Y(t-1) are the actual value (inputs) and the forecast (outputs), respectively.

A model which depends only on the previous outputs of the system is called an autoregressive model (AR), while a model which depends only on the inputs to the system is called a moving average model (MA), and of course a model based on both inputs and outputs is an autoregressive-moving-average model (ARMA). Note that by definition, the AR model has only poles while the MA model has only zeros. Deriving the autoregressive model (AR) involves estimating the coefficients of the model using the method of least squared error.

Autoregressive processes as their name implies, regress on themselves. If an observation made at time (t), then, p-order, [AR(p)], autoregressive model satisfies the equation:

X(t) = F₀ + F₁X(t-1) + F₂X(t-2) + F₂X(t-3) + . . . . + F_pX(t-p) + e_t,

where e_t is a White-Noise series.

R code example for AR(1) from here http://jblevins.org/computing/r/mle/ local file mre.R

Plotting and sorting example:

tzsize <-read.table("c:\\michael\\tz_size.txt", header=TRUE)

sorted= tzsize[order(tzsize$tz),]

x = sorted$size

names(x) = sorted$tz

barplot(x)

mycolors=c("red","blue","green","brown")

barplot(x,col=mycolors)

> t=read.table("table_with_2_columns.txt", sep="|")

> plot(t$V1,t$V2/1000000000 , main=" title here")

Pairing barplot using as.matrix http://www.statmethods.net/graphs/bar.html

If argument of barplot is matrix then beside=TRUE is for grouped bars beside=FALSE for stacked bar

http://www.harding.edu/fmccown/r/

t <-read.table("tz.txt", header=TRUE)

barplot(as.matrix(rbind(t$X._size,t$X._count)), main="TImeZones", ylab= "Total",

beside=TRUE, col=rainbow(2), names.arg=t$tz)

legend("topleft", c("%Size","%Records"), cex=0.6, bty="n", fill=rainbow(2));

Stacked Bar Example

barplot(as.matrix(cbind(t$X._size,t$X._count)))

http://www.packtpub.com/article/customizing-graphics-creating-bar-chart-scatterplot-r

http://onertipaday.blogspot.com/2007/05/make-many-barplot-into-one-plot.html

http://learnr.wordpress.com/

http://www.statmethods.net/graphs/bar.html

http://www.r-tutor.com/ http://www.harding.edu/fmccown/r/

http://stotastic.com/wordpress/2010/04/case-shiller/

Lattice

library("lattice")

p <- barchart((1:10)^2~1:10, horiz=FALSE, ylim=c(0,120),

panel=function(...) {

args <- list(...);

panel.text(args$x, args$y+2, args$y);

panel.barchart(...)

})

print(p)

MyData <- as.data.frame(Titanic)

library(lattice)

barchart(Freq ~ Survived | Age * Sex, groups = Class, data = MyData,

auto.key = list(points = FALSE, rectangles = TRUE, space

= "right", title = "Class", border = TRUE), xlab = "Survived",

ylim = c(0, 800))