R

When RStudio came out with R markdown, I thought this was the solution I've been looking for to be able to seamlessly present my code and notes/reflections much less awkwardly (adding notes in the code as comments), having a centralized file with code development/evolving analysis/report/lab notebook all rolled into one. However, I learned the hard way that having all that in one R markdown (Rmd) is not the way to go given the heavy processes and big data that I work with. What I learned how to organize my code:

Use R notebook (not R markdown though they share the same extension .Rmd) as the core skeleton to call separate R scripts. The "unifying" role of R notebook as a place to know where you call different R scripts and have saved R-generated plots and other files.
Never include the following code directly in your R notebook, but in a separate R script file that you call as needed:
- more stable parts of your data analysis (e.g. data import and clean-up), though this also evolves, too, as I do my analyses
- heavy processes (e.g. importing large data, heavy computation -- this includes correlation pair plots)
- plots that are too detailed to see in the final html/pdf anyway unless saved on their own (like correlation pair plots with lots of factors)

Here is my optimal organization for data analyses:

proj1_cfg_gen.R: to streamline and make all plots consistent, R code with paths, files, file naming conventions, plotting colors/font sizes/output sizes/themes, and other values that I want access to from different code files
proj1_fcns.R: my own functions
proj1_01_behav.Rmd: R notebook where I call my external files with R code and use as my analysis notebook (code + notes), e.g. near the beginning of my notebook, I have:

rm(list=ls())

whereami <- 'server'

switch(whereami,

"server" = {

cfg.path.base <- '/netapp/vol_dat/proj1'

cfg.dir.scripts <- 'code'

"local" = {

cfg.path.base <- '/Users/me/proj1'

cfg.dir.scripts <- 'code'

)

cfg.path.scripts <- file.path(cfg.path.base, cfg.dir.scripts)

source( file.path(cfg.path.scripts,'proj1_cfg_gen.R') )

load_whatdat <- 'raw'
switch(load_whatdat,

'rdat' = {

load( file.path(cfg.path.dat.behav,'proj1_01_behav_basic.RData') )

'raw' = {

source( file.path(cfg.path.scripts, 'proj1_01a_import_data.R') )

source( file.path(cfg.path.scripts, 'proj1_01b_calc_metrics.R') )

save.image( file.path(cfg.path.dat.behav,'proj1_01_behav_basic.RData') )

source( file.path(cfg.path.scripts, 'proj1_01c_plot_corrpairs.R') )

}

)

proj1_01a_import_data.R: import and clean data -- where i create all data structures used in the R notebook (try to avoid creating new data structures in the R notebook -- also makes it easier to see where you've created what data structure if you put the data structure name as headings in the R script)
- tip: R studio's handy "document outline" function (outline button to the right of the "Source" button in the coding pane) is like a TOC. Create headings using R-markdown syntax (# for level-1 heading, ## for level-2 heading) -- but to make sure the heading appears in the outline, end all heading lines with four #, i.e. ####
proj1_01b_calc_metrics.R: some heavy process that won't change (here that I need as soon as I load data so it's along
proj1_01c_plot_corrpairs.R: my big data exploration plots are often mega correlation pair plots to get a feel for distributions and relationship between my factors (my favorite for functionality and the type of customizations I like being able to control is GGally::ggpairs() -- see what ggpairs() is capable of and examples of different correlogram packages on Rpubs) -- these take long and with the amount of factors I usually have, I learned the hard way that it's best to create them separately and save them to file instead of having them directly in the R notebook.
- tip: Correlograms are often things that you create as you go along during data exploration, so it's an iterative process of adding new data structures (where you've filtered/selected certain data as you see necessary the more you analyze your data), but to make everything clean, keep data structure creation in your initial data cleanup R code.

cookbook

You've gotta love O'Reilly Publishers and their R Cookbook (full pdf of 1st edition :-o).

There is also an online Cookbook for R, but here are some things I personally find useful and keep reusing (and keep forgetting how to do them).

Environment setup

Load/install libraries

Load libraries and install if not found. Handy when you share your R files.

loadLibs <- function(x) {

for( i in x ){

# require returns TRUE invisibly if it was able to load package

if( ! require( i , character.only = TRUE ) ){

# If package was not able to be loaded then re-install

install.packages( i , dependencies = TRUE )

# Load package after installing

require( i , character.only = TRUE )

}

Usage:

libs2load = c('dplyr',

'tidyr',

'ggplot2'

)

loadLibs(libs2load)

Specifying path to libraries

If you have a personal folder where you want to install libraries (maybe you don't have access to install new ones because you are on a server where you are not a superuser), in your ~/.Renviron file (create it if it does not exist, add the lines

R_LIBS=/somewhere/global/library/where/you/have/no/write/access

R_LIBS_USER=/path/to/your/personal/library

Knowing your environment

Data Wrangling

Import data

Import all JSON files into one data frame

jsonfiles <- list.files(path=dir_data.epmLogs, pattern="*.json", full.names=T)epm.df <- dplyr::bind_rows(plyr::ldply(jsonfiles,
function(x) {
x.l <- jsonlite::fromJSON(x, flatten=TRUE)
x.df <- as.data.frame(x.l)
}))

variable assignment

Simultaneously assign to a variable and simultaneously output results to console

R usually suppresses output to console when assigning to a variable (<-). Rather than needing two lines, one for the assignment, the second to see the content of the newly created dataframe, do:

new.df <- old.df[,c(1,2,4:7)]

new.df

Put () around the assignment:

(new.df <- old.df[,c(1,2,4:7)])columns in a dataframe

Selecting specific columns

By index --

e.g. only take columns 1, 2, 4, 5, 6, 7 from dataframe df1:

df1[,c(1,2,4:7)]

By column name:

df1[,c("name", "age", "group")]

Deleting a specific column by index

Use a - and the column index, e.g.

Concatenate dataframes df1 and df2, using all but column 4 of df1:

rbind(df1[,-4], df2)

Get only specific levels of a factor and relevel the factor when saving to another dataframe

e.g. Exclude (!) matches of a list of levels and relevel (,drop=TRUE -- notice there are double commas; first one for specifying all the rows for the match, second one as placeholder for the column -- unless you want to match certain columns) so that the factor in the new dataframe do not include these levels:

The loop way in base R:

exclLevs <- c("lev1" "lev2")

for ( alvl in exclLevs ) {

print(ss)

new.df <- old.df[ !old.df$factorX == alvl,,drop=TRUE]

}

Using the %in% operator instead of loops (returns a binary vector so you can use as indexing):

exclLevs <- c("lev1" "lev2")

new.df <- old.df[ !names(old.df) %in% exclLevs ]

Rename column in a dataframe

Using base R:

Replace column oldColname by new Colname in dataframe df

names(df)[names(df) == "oldColname"] <- "newColname"

Replace strings in column names matching a string, e.g. strip trailing dots in column names:

colnames(df) <- gsub('\\.$','',colnames(df))

Cast factor into a numeric variable

Doing as.numeric(factor) force casts incorrect values. Two ways of doing it correctly -- the first is recommended and faster:

as.numeric(levels(df$x))[df$x]

as.numeric(as.character(df$x))

columns in a dataframe

Selecting specific columns

By index --

e.g. only take columns 1, 2, 4, 5, 6, 7 from dataframe df1:

df1[,c(1,2,4:7)]

By column name:

df1[,c("name", "age", "group")]

Deleting a specific column by index

Use a - and the column index, e.g.

Concatenate dataframes df1 and df2, using all but column 4 of df1:

rbind(df1[,-4], df2)

Get only specific levels of a factor and relevel the factor when saving to another dataframe

The loop way in base R:

exclLevs <- c("lev1" "lev2")

for ( alvl in exclLevs ) {

print(ss)

new.df <- old.df[ !old.df$factorX == alvl,,drop=TRUE]

}

Using the %in% operator instead of loops (returns a binary vector so you can use as indexing):

exclLevs <- c("lev1" "lev2")

new.df <- old.df[ !names(old.df) %in% exclLevs ]

Rename column in a dataframe

Using base R:

Replace column oldColname by new Colname in dataframe df

names(df)[names(df) == "oldColname"] <- "newColname"

Replace strings in column names matching a string, e.g. strip trailing dots in column names:

colnames(df) <- gsub('\\.$','',colnames(df))

Cast factor into a numeric variable

Doing as.numeric(factor) force casts incorrect values. Two ways of doing it correctly -- the first is recommended and faster:

as.numeric(levels(df$x))[df$x]

as.numeric(as.character(df$x))

List-Columns in a dataframe

Grouping by value of one column and creating a new column based on transformation of another column of list of values

e.g. Group values by ID (creates list-columns of all rows belonging to ID) and take the mean of all the values above 0 of column err:

data.df %>% select(ID, err) %>%

group_by(ID) %>%

nest(err.list = err) %>%

mutate( err_mean = map_dbl(err.list, ~mean(.>0) ) )

Substitute by regular expression match

Backreference is possible with gsub() using \1, \2 etc.

e.g. Create a new column Grp in my dataframe which lops off everything after the underscore in the levels of GROUP:

dat1.df$Grp <- gsub("(.*)_.*","\\1",dat1.df$GROUP)

Create new column by combining values of other columns

e.g. Create a new

dat1.df$greeting <- paste(dat1.df$title, dat1.df$LastName,sep='.')

Set reference level of a factor

By default, R orders factors alphabetically A-Z and takes the first level as the reference. To specify the reference level, call relevel() from the stats package, specifying the option ref:

dat.df$GROUP <- relevel( dat.df$GROUP, ref="PBO")

summarizing data

Descriptive stats across various factors

This applies summary functions on each factor before moving onto the next one in the list.

tmp_demogr.v <- c("age", "wt_kg", "ht_m", "BMI" )

Ss.df %>%

group_by(sex) %>%

summarise(n = n(),

across(all_of(tmp_demogr.v),

list(mu = mean, sd = sd, med = median, min = min, max = max), na.rm = TRUE,

.names = "{col}_{fn}"))

Lookup tables / key-value pairs (hash tables)

Use the package hashmap

Hors <- c("E2", "P4", "TST", "IGF1", "DHT", "LH", "FSH")

units <- c("pg/mL", "ng/mL", "ng/mL", "ng/mL", "pg/mL", "mIU/mL", "mIU/mL")

Hunits <- hashmap::hashmap(Hors,units)

# Get units of hormone FSH

Hunits[['FSH']]

Convert list of lists into tibble

list1 <- list(id=1, a = 2, b = c(3,4,5), c='stuff')

list2 <- list(id=2, a = 3, b = c(9,3,2), c='otherStuff')

mamalist <- list(list1, list2)

mamalist.tb <- mamalist %>% transpose() %>% as_tibble()

mamalist.tb <- mamalist.tb %>% unnest_longer(id) %>% unnest_longer(a) %>% unnest_longer(c)

Printing name of object (e.g. function)

# print the name of a function FUN (e.g. mean) as a string

deparse(substitute(FUN))

Handling NA's and NaN's

Get only subjects with no missing data

# print rows where there is an NaN in a dataframe df (also works with matrices)

df[!complete.cases(df),]

Calculate diff in years between 2 dates

## and round to 3 digits

date.fmt <- "%Y-%m-%d" ## give date format, here e.g. 2000-12-31

signif((as.Date(df$end_date, date.fmt) - as.Date(df$start_date, date.fmt))/365, 3)

User-defined Functions

Stats

Summarize data

describe()

describeBy()

describeBy(sub_info$Menses.D2_D, group=sub_info$GROUP, mat=TRUE)

Get outliers

thresh <- 4

outliers.age <-

dat[(dat$age < -thresh

| dat$age > thresh) # outliers are more than +/- thresh

& !is.na(dat$age), # also exclude NAs

c("ID","age")] # only output cols "ID" and "age"

Use list of outliers in one dataframe to subset another dataframe

Example to combine information about outliers found from another dataframe

# the () around the assignment also prints results to console

(infoComb.noOutliers.df <-

merge(info1.df[ info1.df$ID %in% outliers.age$ID

& info1.df$day=='d1', ], # any additional subsetting

info2.df,

by="ID")) # column to merge by

Batch-run t-tests

splitting data by certain factors, e.g. here by condition (cond), group and sex

load(dplyr)

load(tidyr)

load(broom) # for tidy() compact formatting of stat fcns like t.test()

metrics.condXsexXgrp.ttest <- metrics.df %>%

group_by(cond, group, sex) %>%

summarise(res = list( tidy( t.test(metricX) ) ) ) %>%

unnest() %>% arrange(cond, group, sex)

## alternatively:

metrics.df %>%

nest(-cond, -group, -sex) %>% # creates a list-column with column name "data"

mutate(t_test = map(data, ~{ t.test(.x$metricX, mu = 0 ) %>% tidy() { )) %>% # ".x" to access data partitioned by cond/group/sex -- "." will not be subsetted

unnest(t_test) %>% arrange(cond, group, sex)

Batch-run t-tests

splitting data by certain factors, e.g. here by condition (cond), group and sex

load(dplyr)

load(tidyr)

load(broom) # for tidy() compact formatting of stat fcns like t.test()

metrics.condXsexXgrp.ttest <- metrics.df %>%

group_by(cond, group, sex) %>%

summarise(res = list( tidy( t.test(metricX) ) ) ) %>%

unnest() %>% arrange(cond, group, sex)

## alternatively:

metrics.df %>%

nest(-cond, -group, -sex) %>% # creates a list-column with column name "data"

mutate(t_test = map(data, ~{ t.test(.x$metricX, mu = 0 ) %>% tidy() { )) %>% # ".x" to access data partitioned by cond/group/sex -- "." will not be subsetted

unnest(t_test) %>% arrange(cond, group, sex)

Run T-tests given mean, SD and n

equal.variance: whether or not to assume equal variance. Default is FALSE.

# m1, m2: the sample means

# s1, s2: the sample standard deviations

# n1, n2: the same sizes

# m0: the null value for the difference in means to be tested for. Default is 0.

# equal.variance: whether or not to assume equal variance. Default is FALSE.

# returns a list with stats

t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)

{

sigdig <- 3

if( equal.variance==FALSE )

{

se <- sqrt( (s1^2/n1) + (s2^2/n2) )

# welch-satterthwaite df

df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )

} else

{

# pooled standard deviation, scaled by the sample sizes

se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) )

df <- n1+n2-2

}

t <- (m1-m2-m0)/se

stats <- c(format(round(m1-m2,digits=sigdig),nsmall=sigdig),

format(round(df,digits=sigdig),nsmall=sigdig),

format(round(se,digits=sigdig),nsmall=sigdig),

format(round(t,digits=sigdig),nsmall=sigdig),

format(round(2*pt(-abs(t),df),digits=sigdig),nsmall=sigdig))

names(stats) <- c("Difference of means", "df","Std Error", "t", "p-value")

return(stats)

}

Batch-run ANOVA

load(dplyr)
load(tidyr)

metrics.condXsexXgrp.ttest <- metrics.df %>%

group_by(cond, group, sex) %>%

summarise(res = list( tidy( t.test(metricX) ) ) ) %>%

unnest() %>% arrange(cond, group, sex)

Contrast coding in R

By default, R uses dummy contrast coding (all levels contrasted against a reference level, which if not set is what is first alphabeticall).

R has 4 built-in contrasts:

dummy (each level compared to the reference level)
- intercept = mean of the reference grp
- e.g. for 4 levels: (R code contr.treatment(4)):

2 3 4
1 0 0 0

2 1 0 0

3 0 1 0

4 0 0 1

- to relevel the baseline, say to the 3rd level:
  contrasts(df$factor) = 'contr.treatment'(levels(df$factor), base = 3)
deviation (each level compared to grand mean)
- intercept = grand mean of across all the levels in the grp
- e.g. for 4 levels: (R code contr.sum(4)):c
Helmert (each level compared to the mean of the subsequent level of the variable)
- intercept = mean of ref level
- e.g. for 4 levels (R code: contr.helmert(4)):

1vs2+3+4 2vs3+4 3vs4
1 3/4 0 0

2 -1/4 2/3 0

3 -1/4 -1/3 1/2

4 -1/4 -1/3 -1/2

alternatively:

orthogonal polynomial (trend analysis for ordinal variable in which levels are equally spaced; R code: contr.poly(nlevs))

load(dplyr)
load(tidyr)

metrics.condXsexXgrp.ttest <- metrics.df %>%

group_by(cond, group, sex) %>%

summarise(res = list( tidy( t.test(metricX) ) ) ) %>%

unnest() %>% arrange(cond, group, sex)

Other common contrast codings:

simple (like dummy but only diff is the intercept)
- intercept = grand mean
- e.g. for 4 levels

c_dummy <- contr.treatment(4)

c_spec <- matrix(rep(1/4, 12), ncol=3)

(c_simple <- c_dummy - c_spec)

2 3 4

1 -0.25 -0.25 -0.25

2 0.75 -0.25 -0.25

3 -0.25 0.75 -0.25

4 -0.25 -0.25 0.75

contrasts(df$x) <- c_simple

lm(y ~ x, df) # lm will adopt defined contrast

Specifying your own contrasts

Specify as many contrasts as you have levels in your group (degrees of freedom).

For example, 4 contrasts for 4 levels in factor:

1st 3 levels vs. last 2 levels: -1/3, -1/3, -1/3, 1/2, 1/2
3rd vs. 4th: 0, 0, 0, 1, -1
...

contrasts(df$factor) <- matrix(c(-1/3, -1/3, -1/3, 1/2, 1/2,

0, 0, 0, 1, -1,

0, 1, -1, 0, 0),

nrow = 5, ncol = 3)

df.lm <- lm(y ~ factor)

summary.aov(df.lm, split=list(Group=list("First3 vs. last2" = 1,

"3-4" = 2,

"2-3" = 3)))

Linear regression

Understanding R's output

Walked-through example on tds for a continuous factor, or an example on STHDA with categorical factors.

Plotting

ggplot: Add outlier labels to boxplot

# from

# http://rrubyperlundich.blogspot.com/2015/07/1-label-outlier-in-ggplot2-boxplot.html

# add labels to outliers in a ggplot2 boxplot

# input: 1). p - ggplot boxplot obj (dataframe given to ggplot must contain x,y and labeling variable

# 2). labvar - (opt'l) string containing the name of the var containing the labels. Default is value itself

add.outlier <- function(p,labvar = as.character(p$mapping$y)){

df <- data.frame(y = with(p$data,eval(p$mapping$y)),

x = with(p$data,eval(p$mapping$x)))

df.l <- split(df,df$x)

mm <- Reduce(rbind, lapply(df.l,FUN = function(df){

data.frame(y = df$y[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))],

x = df$x[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))]

)})

)

Arrange multi-plots in one figure from plots generated in a loop

## save plots in a list

plot.l <- list()

i <- 0

for (m in loopThruMe.l) { ## loopThruMe.l is some list of things to loop through

i <- i+1

plot.l[[i]] <- somePlotFcn(someArg, m)

}

## arranges everything in one figure;

## no easy config for a common legend

do.call(gridExtra::grid.arrange,

c(plot.l, ncol=someNoCols, top="a title"))

## default nrow/ncol=1 for what is not specified and creates multi-figure;

## plots with specified # of cols and/or rows;

## easy option for common legend

do.call(ggpubr::ggarrange,

c(plot.l,

list(ncol=someNoCols, common.legend=T) )

Saving publication-ready plots

Only base R pays attention to given font sizes...Check out very informative info page on saving figures in R from the University of Marburg's learning platform, which offers the following guidelines for creating your figures:

width: 8.30 cm (one-column images), 17.35 cm(two-column images)
maximum height: 23.35 cm (caption will not fit on the same page then)
minimum resolution: 300 ppi
compression: LZW
color mode: RGB (millions of colors), 8 bits per channel
background: white, not transparent
font type, size: Arial, Times or Symbol, 6 to 12 pt
lines: line width between 0.5 to 1.5 pt
white space: a 2 pt white space around each figure is recommended
file size: 10 MB max

fig.w.2col_cm <- 17.3

fig.w.1col_cm <- 8.30

fig.h.cm <- 23.35

fig.w.2col_in <- 6.8

fig.w.1col_in <- 3.2

## e.g. saving a ggplot2 object gg:

## ggsave(file.sep(dir_analyses,"RT_hist.pdf"),

## gg,

## width = fig.w.2col_in *.8, height=6.5)

Also to streamline and facilitate consistent figures, I define variables for ggplot2 layers and themes which I use each time I generate a plot, e.g.

rgb_PT <- "pink3"

rgb_HC <- "darkslategray"

cfg.plot,pch <- c(19,1) # solid circle, unfilled circle

## define ggplot2 layers

cfg.plot,col.gg <- scale_color_manual(values = c(col_HC, col_PT))

cfg.plot.fill.gg <- scale_fill_manual(values = c(col_HC, col_PT))

cfg.plot.pch.gg <- scale_shape_manual(values = plot_pch)

theme_proj1.gg <- ggplot2::theme_bw() + ## now gray plot bkgd

ggplot2::theme(strip.text = element_text(size=rel(0.8)), # font size of facets

axis.text.x = element_text(size=rel(0.7)),

axis.text.y = element_text(size=rel(0.7)),

axis.title.x = element_text(size=rel(0.8)),

axis.title.y = element_text(size=rel(0.8)),

legend.title = element_text(size=rel(0.8)),

legend.text = element_text(size=rel(0.7)),

legend.key = element_blank(), # remove fill in legend

plot.title = element_text(hjust = 0.5, size=11))

Colors

What does that color/palette look like?

Sometimes hard to visualize what an array of colors look like for your plots.

Resources:

R chart's color resources for some predefined color names and corresponding HEX code or convert between HEX and RGB.
ColorBrewer for interactive and more complex coloring (like for cartography)
R Colors - cheat sheet
Data Novia's guide to choosing color palettes in R for visualizing data

Otherwise, there are various ways to visualize other than just plotting with those colors, but the scales::show_col make it light work:

col_3prs <- c("cornsilk", "navajowhite", "lightskyblue", "deepskyblue", "darkorange", "darkorange3") # choose one of the 657 colors in R with predefined color names
scales::show_col(col_3prs)

profiling and benchmarking

Tutorial here

Show memory usage of your data objects:

showMemoryUse(decreasing=TRUE, limit=10)

PITFALLS

Things to watch out for.

String concatenation

paste() with no delimiter (sep="") specified separates concatenated strings by a space by default. Use paste0() to concatenate with no whitespace

Merging data frames

When you merge only specific columns from each dataframe, make sure you include the factor you merge by. Otherwise it throws a very unhelpful error.

ANOVA with non-factors

Make sure your factors have the data type factor (in particular for any random effects terms) -- otherwise ANOVA functions from certain functions will give you weird results.

Conflicting packages

For example, dplyr and plyr use functions with the same name. If you do not know why things aren't working like they used to, it might have something to do with the order you have loaded your packages (with the last one overriding the functions defined in an earlier loaded package. For the conflict between dplyr and plyr, specifically, usually you want dplyr for the additional functions it provides (e.g. rename) so load plyr first and then dplyr so that dplyr doesn't break.

To avoid this problem in general, call functions from particular packages by specifying the namespace (PKG::FUNCTION):

iris %.% group_by(Species) %.% dplyr::summarize(p=mean(Petal.Length))

Check the help page on double colon and triple colon operators for the difference between PKG::FUNCTION and PKG::FUNCTION.

Variables/Plots in function/loop not displaying

Plot or variable not being displayed like they usually would outside of the loop or function? You need to wrap the plot object/variable in print()

Defining a new variable on the LHS of "=" in dplyr functions

You need to use the operator := instead of =

RMD

R Markdown in RStudio takes commenting your code to another level -- it makes code writing and reporting seamless (kind of) with code chunks and inline output.

However....

General tips on getting started

Do not create one big .Rmd file
- When I first started using R Markdown, I wanted to keep all my analysis attempts in one comprehensive R Markdown report. I would save different images into different Rdata and Rhistory, and try and save to my html/pdf in chunks as I went along rather than knitting the entire report in the end all at once -- but this never worked out (see below). In the end my code in my Rmd was too flat and the heavy processes meant I would get memory errors before it would knit into a complete report (because inevitably the section I had run to get output would be cleared no matter what I did -- see below) -- so don't do it! Break it up. Modularize your code and processes.
Put R code that is not important to your report as functions in a separate .R file and source() it from your .Rmd file
Add a date and table of contents to your report by adding these lines to your YAML header:

date: "`r format(Sys.time(), '%d %B %Y')`"

output:

pdf_document:

highlight: tango

keep_tex: true

number_sections: true

toc: true

html_notebook:

toc: true

toc_float: true

number_sections: true

Do you really need R Markdown? Some heavy processes should really just be R scripts instead, or at the most R Notebooks (see Tips for Heavy Processes below).
Vignette on creating pretty R Markdown output files

R Markdown vs. R Notebook

- R Notebook might be better suited for
  - exploratory analyses and keeping track of all the things you have tried
- R Markdown
  - creating reports
  - taking the exploratory analyses from R Notebook that you want to keep and creating final reports from those
- R Notebook = R Markdown in....
  - the markup language (the only thing that indicates that it is one and not the other is the YAML header)
    - YAML header for R Notebook: ouptut: html_notebook
    - YAML header for R Markdown -- one of these:
      - ouptut: html_document
      - output: pdf_document
      - output: word_document
      - and many more options including HTML presentation slides, interactive dashboards
  - the file extension
- R Notebook vs. R Markdown (differences)
  - sending code to console
    - R Notebook will send code lines one at a time (allowing you to run separate chunks at a time and stop when a line raises an error), whereas R Markdown sends all the code to the console (even though you can execute/test single chunks at a time.
    - R Notebook is "previewed" i.e. will generate an HTML preview of your the HTML view of your code without running any R code chunks.
    - R Markdown is "knitted" and sends all the code chunks to console when it generates a report. Even though it generates intermediate reports in whatever file format you have specified (html, pdf, Word doc) whenever you save your Rmd, you cannot rely on this being a complete report because R Studio sometimes clears outputs (see Cleared Outputs below).
  - commenting
    - R Notebook allows multiline commenting 
    - R Markdown only allows single-line commenting with prepending each comment line by #

Tips for heavy processes

R Markdown is heavier than running scripts in a terminal, and it can crash. Specifically it will report that it needs to restart R and in doing so sometimes) it is not suited to heavy processes. Advice to deal with this can be found in this forum thread, but to summarize and integrate the collective suggestions from the R community:

For code with heavy processes, R Notebook is more suited than R Markdown.
Not knitting certain code chunks: add {r eval=FALSE } to your code chunk
Move the analysis-heavy part of the code outside the R Markdown document completely to an R script (it also makes debugging easier). The script must save the results of the analysis to some file, e.g., .csv, .rds, .rda or a feather file. Then in the R Markdown you can add a chunk with an if statement, which checks if the file exists (in which case it loads it) otherwise it sources the analysis script. This way, knitting takes way less time.
Put long-running processes in the setup chunk (test in console) along with intermediate results that you would want to keep in the final report. When ready for what code chunks to keep with accompanying text, saveRDS the object that took forever to a public repository, either on GIT or S3, comment out all the program and add

con <- path_to_rds

load(con)

close(con)

That gives you everything you need to write/knit/write/knit without waiting forever for the code to run.

save.image() different sections of your R Markdown into different .RData's
- NOTE: Setting the working directory within a code chunk in your .Rmd does not actually set the working directory outside of that code chunk, you need to run setwd() in the console manually.
use drake
- designed for data science -- avoid the Sysphean loop of waiting forever for code to run, hitting a problem and rerunning from scratch (with drake, what was run will not be rerun)
- uses plans to schedule work, detects dependency relationships between targets to be able to run targets in parallel
- does require a different way of thinking -- get started on the drake reference website (with intro video and code, where to go for more help)

Cleared outputs

I have found that RStudio will clear the output or your RMarkdown output file periodically, so even though I don't knit my entire Rmd but am hoping to only need to run code chunks one at a time (and when you save your Rmd, it actually updates the output report file in the format you have specified), I get foiled. Inline outputs in your Rmd sometimes gets cleared when

RStudio throws an error and wants to restart R.
you decide you want to globally replace the name of a variable.
when you move code chunks around.

Error Messages / TROUBLESHOOTING

Error: variable 'X' was fitted with type "nmatrix.1" but type "numeric" was supplied

Triggers for the error: when you create a new column in a dataframe using

df$x_sc <- scale(df$x)

df <- df %>% mutate(x_sc = scale(x))

then fit a model with that column and use something like the predict() or a function from the ggeffects or sjPlot package which calls predict().

When you do a str(df), you will see that df$x_sc is a matrix:

$ x_sc : num [1:126, 1] 0.5 1 ...
$ x : int 2.3 4 ...

Apparently, to quote this forum post:

It's a feature and it's been there forever. (It's even present in

another system not unlike R.)

Suppose you set

y <- matrix(1:3)

and construct

dfr <- data.frame(x=1:3, y)

Then you invoke the constructor function, data.frame, which by default

simplifies things like matrices to single columns, naming them as

necessary.

Now if you directly modify dfr by adding another component, like

dfr$yy <- y

You bypass the constructor function and its default simplifications, but

you do not bypass the structure tests. This is, in fact the simplest

way to put a matrix inside a data frame intact, but it must have the

same number of rows as has the data frame itself.

One work-around is to add the new column this way:

df['x_sc'] <- as.data.frame(scale(df$x))

The following, by the way, does not work:

df$x_sc <- as.data.frame(scale(df$x))

'only 'grobs' allowed in "gList"' when passing a list of ggplot objects to gridExtra::grid.arrange or ggpubr::ggarrange

Among other things more easily more commonly talked about in forums, check that you do not have NULL objects in your plot list -- if so, remove them from the list:

plot.list [ lengths(plot.list) == 0] <- NULL
do.call(gridExtra::grid.arrange, c(plot.list, list(ncol = someNoCols))

## alternatively:

do.call(ggpubr::ggarrange, c(plot.list,

list(ncol=someNoCols, common.legend=T)))

"Error in dim(rvec) <- dim(x)" when using kable()

One possibility is that you have a row of NA -- filter these out before using kable().

Online resources

R Markdown Cheatsheet
A list of R Cheatsheets (e.g. for purrr:apply, data import, data transformations, RStudio, data visualization)
Quick reference for predefined colors
Color plalette cheatsheet pdf
Get help from the RStudio Community

Looking for an R package that will do what you want? Search here on rseek.org

A catalogue of examples of all the graphing and visualization capabilities with R

The R Graph Gallery – Help and inspiration for R chartsThe R graph gallery displays hundreds of charts made with R, always providing the reproducible code.

Page updated

Report abuse