When RStudio came out with R markdown, I thought this was the solution I've been looking for to be able to seamlessly present my code and notes/reflections much less awkwardly (adding notes in the code as comments), having a centralized file with code development/evolving analysis/report/lab notebook all rolled into one. However, I learned the hard way that having all that in one R markdown (Rmd) is not the way to go given the heavy processes and big data that I work with. What I learned how to organize my code:
Use R notebook (not R markdown though they share the same extension .Rmd) as the core skeleton to call separate R scripts. The "unifying" role of R notebook as a place to know where you call different R scripts and have saved R-generated plots and other files.
Never include the following code directly in your R notebook, but in a separate R script file that you call as needed:
more stable parts of your data analysis (e.g. data import and clean-up), though this also evolves, too, as I do my analyses
heavy processes (e.g. importing large data, heavy computation -- this includes correlation pair plots)
plots that are too detailed to see in the final html/pdf anyway unless saved on their own (like correlation pair plots with lots of factors)
Here is my optimal organization for data analyses:
proj1_cfg_gen.R: to streamline and make all plots consistent, R code with paths, files, file naming conventions, plotting colors/font sizes/output sizes/themes, and other values that I want access to from different code files
proj1_fcns.R: my own functions
proj1_01_behav.Rmd: R notebook where I call my external files with R code and use as my analysis notebook (code + notes), e.g. near the beginning of my notebook, I have:
rm(list=ls())
whereami <- 'server'
switch(whereami,
"server" = {
cfg.path.base <- '/netapp/vol_dat/proj1'
cfg.dir.scripts <- 'code'
},
"local" = {
cfg.path.base <- '/Users/me/proj1'
cfg.dir.scripts <- 'code'
},
)
cfg.path.scripts <- file.path(cfg.path.base, cfg.dir.scripts)
source( file.path(cfg.path.scripts,'proj1_cfg_gen.R') )
load_whatdat <- 'raw'
switch(load_whatdat,
'rdat' = {
load( file.path(cfg.path.dat.behav,'proj1_01_behav_basic.RData') )
},
'raw' = {
source( file.path(cfg.path.scripts, 'proj1_01a_import_data.R') )
source( file.path(cfg.path.scripts, 'proj1_01b_calc_metrics.R') )
save.image( file.path(cfg.path.dat.behav,'proj1_01_behav_basic.RData') )
source( file.path(cfg.path.scripts, 'proj1_01c_plot_corrpairs.R') )
}
)
proj1_01a_import_data.R: import and clean data -- where i create all data structures used in the R notebook (try to avoid creating new data structures in the R notebook -- also makes it easier to see where you've created what data structure if you put the data structure name as headings in the R script)
tip: R studio's handy "document outline" function (outline button to the right of the "Source" button in the coding pane) is like a TOC. Create headings using R-markdown syntax (# for level-1 heading, ## for level-2 heading) -- but to make sure the heading appears in the outline, end all heading lines with four #, i.e. ####
proj1_01b_calc_metrics.R: some heavy process that won't change (here that I need as soon as I load data so it's along
proj1_01c_plot_corrpairs.R: my big data exploration plots are often mega correlation pair plots to get a feel for distributions and relationship between my factors (my favorite for functionality and the type of customizations I like being able to control is GGally::ggpairs() -- see what ggpairs() is capable of and examples of different correlogram packages on Rpubs) -- these take long and with the amount of factors I usually have, I learned the hard way that it's best to create them separately and save them to file instead of having them directly in the R notebook.
tip: Correlograms are often things that you create as you go along during data exploration, so it's an iterative process of adding new data structures (where you've filtered/selected certain data as you see necessary the more you analyze your data), but to make everything clean, keep data structure creation in your initial data cleanup R code.
You've gotta love O'Reilly Publishers and their R Cookbook (full pdf of 1st edition :-o).
There is also an online Cookbook for R, but here are some things I personally find useful and keep reusing (and keep forgetting how to do them).
Load libraries and install if not found. Handy when you share your R files.
loadLibs <- function(x) {
for( i in x ){
# require returns TRUE invisibly if it was able to load package
if( ! require( i , character.only = TRUE ) ){
# If package was not able to be loaded then re-install
install.packages( i , dependencies = TRUE )
# Load package after installing
require( i , character.only = TRUE )
}
}
}
Usage:
libs2load = c('dplyr',
'tidyr',
'ggplot2'
)
loadLibs(libs2load)
If you have a personal folder where you want to install libraries (maybe you don't have access to install new ones because you are on a server where you are not a superuser), in your ~/.Renviron file (create it if it does not exist, add the lines
R_LIBS=/somewhere/global/library/where/you/have/no/write/access
R_LIBS_USER=/path/to/your/personal/library
jsonfiles <- list.files(path=dir_data.epmLogs, pattern="*.json", full.names=T)epm.df <- dplyr::bind_rows(plyr::ldply(jsonfiles,
function(x) {
x.l <- jsonlite::fromJSON(x, flatten=TRUE)
x.df <- as.data.frame(x.l)
}))
R usually suppresses output to console when assigning to a variable (<-). Rather than needing two lines, one for the assignment, the second to see the content of the newly created dataframe, do:
new.df <- old.df[,c(1,2,4:7)]
new.df
Put () around the assignment:
(new.df <- old.df[,c(1,2,4:7)])columns in a dataframe
By index --
e.g. only take columns 1, 2, 4, 5, 6, 7 from dataframe df1:
df1[,c(1,2,4:7)]
By column name:
df1[,c("name", "age", "group")]
Use a - and the column index, e.g.
Concatenate dataframes df1 and df2, using all but column 4 of df1:
rbind(df1[,-4], df2)
e.g. Exclude (!) matches of a list of levels and relevel (,drop=TRUE -- notice there are double commas; first one for specifying all the rows for the match, second one as placeholder for the column -- unless you want to match certain columns) so that the factor in the new dataframe do not include these levels:
The loop way in base R:
exclLevs <- c("lev1" "lev2")
for ( alvl in exclLevs ) {
print(ss)
new.df <- old.df[ !old.df$factorX == alvl,,drop=TRUE]
}
Using the %in% operator instead of loops (returns a binary vector so you can use as indexing):
exclLevs <- c("lev1" "lev2")
new.df <- old.df[ !names(old.df) %in% exclLevs ]
Using base R:
Replace column oldColname by new Colname in dataframe df
names(df)[names(df) == "oldColname"] <- "newColname"
Replace strings in column names matching a string, e.g. strip trailing dots in column names:
colnames(df) <- gsub('\\.$','',colnames(df))
Doing as.numeric(factor) force casts incorrect values. Two ways of doing it correctly -- the first is recommended and faster:
as.numeric(levels(df$x))[df$x]
or
as.numeric(as.character(df$x))
By index --
e.g. only take columns 1, 2, 4, 5, 6, 7 from dataframe df1:
df1[,c(1,2,4:7)]
By column name:
df1[,c("name", "age", "group")]
Use a - and the column index, e.g.
Concatenate dataframes df1 and df2, using all but column 4 of df1:
rbind(df1[,-4], df2)
e.g. Exclude (!) matches of a list of levels and relevel (,drop=TRUE -- notice there are double commas; first one for specifying all the rows for the match, second one as placeholder for the column -- unless you want to match certain columns) so that the factor in the new dataframe do not include these levels:
The loop way in base R:
exclLevs <- c("lev1" "lev2")
for ( alvl in exclLevs ) {
print(ss)
new.df <- old.df[ !old.df$factorX == alvl,,drop=TRUE]
}
Using the %in% operator instead of loops (returns a binary vector so you can use as indexing):
exclLevs <- c("lev1" "lev2")
new.df <- old.df[ !names(old.df) %in% exclLevs ]
Using base R:
Replace column oldColname by new Colname in dataframe df
names(df)[names(df) == "oldColname"] <- "newColname"
Replace strings in column names matching a string, e.g. strip trailing dots in column names:
colnames(df) <- gsub('\\.$','',colnames(df))
Doing as.numeric(factor) force casts incorrect values. Two ways of doing it correctly -- the first is recommended and faster:
as.numeric(levels(df$x))[df$x]
or
as.numeric(as.character(df$x))
e.g. Group values by ID (creates list-columns of all rows belonging to ID) and take the mean of all the values above 0 of column err:
data.df %>% select(ID, err) %>%
group_by(ID) %>%
nest(err.list = err) %>%
mutate( err_mean = map_dbl(err.list, ~mean(.>0) ) )
Backreference is possible with gsub() using \1, \2 etc.
e.g. Create a new column Grp in my dataframe which lops off everything after the underscore in the levels of GROUP:
dat1.df$Grp <- gsub("(.*)_.*","\\1",dat1.df$GROUP)
e.g. Create a new
dat1.df$greeting <- paste(dat1.df$title, dat1.df$LastName,sep='.')
By default, R orders factors alphabetically A-Z and takes the first level as the reference. To specify the reference level, call relevel() from the stats package, specifying the option ref:
dat.df$GROUP <- relevel( dat.df$GROUP, ref="PBO")
This applies summary functions on each factor before moving onto the next one in the list.
tmp_demogr.v <- c("age", "wt_kg", "ht_m", "BMI" )
Ss.df %>%
group_by(sex) %>%
summarise(n = n(),
across(all_of(tmp_demogr.v),
list(mu = mean, sd = sd, med = median, min = min, max = max), na.rm = TRUE,
.names = "{col}_{fn}"))
Use the package hashmap
Hors <- c("E2", "P4", "TST", "IGF1", "DHT", "LH", "FSH")
units <- c("pg/mL", "ng/mL", "ng/mL", "ng/mL", "pg/mL", "mIU/mL", "mIU/mL")
Hunits <- hashmap::hashmap(Hors,units)
# Get units of hormone FSH
Hunits[['FSH']]
list1 <- list(id=1, a = 2, b = c(3,4,5), c='stuff')
list2 <- list(id=2, a = 3, b = c(9,3,2), c='otherStuff')
mamalist <- list(list1, list2)
mamalist.tb <- mamalist %>% transpose() %>% as_tibble()
mamalist.tb <- mamalist.tb %>% unnest_longer(id) %>% unnest_longer(a) %>% unnest_longer(c)
# print the name of a function FUN (e.g. mean) as a string
deparse(substitute(FUN))
# print rows where there is an NaN in a dataframe df (also works with matrices)
df[!complete.cases(df),]
## and round to 3 digits
date.fmt <- "%Y-%m-%d" ## give date format, here e.g. 2000-12-31
signif((as.Date(df$end_date, date.fmt) - as.Date(df$start_date, date.fmt))/365, 3)
describe()
describeBy()
describeBy(sub_info$Menses.D2_D, group=sub_info$GROUP, mat=TRUE)
thresh <- 4
outliers.age <-
dat[(dat$age < -thresh
| dat$age > thresh) # outliers are more than +/- thresh
& !is.na(dat$age), # also exclude NAs
c("ID","age")] # only output cols "ID" and "age"
Example to combine information about outliers found from another dataframe
# the () around the assignment also prints results to console
(infoComb.noOutliers.df <-
merge(info1.df[ info1.df$ID %in% outliers.age$ID
& info1.df$day=='d1', ], # any additional subsetting
info2.df,
by="ID")) # column to merge by
splitting data by certain factors, e.g. here by condition (cond), group and sex
load(dplyr)
load(tidyr)
load(broom) # for tidy() compact formatting of stat fcns like t.test()
metrics.condXsexXgrp.ttest <- metrics.df %>%
group_by(cond, group, sex) %>%
summarise(res = list( tidy( t.test(metricX) ) ) ) %>%
unnest() %>% arrange(cond, group, sex)
## alternatively:
metrics.df %>%
nest(-cond, -group, -sex) %>% # creates a list-column with column name "data"
mutate(t_test = map(data, ~{ t.test(.x$metricX, mu = 0 ) %>% tidy() { )) %>% # ".x" to access data partitioned by cond/group/sex -- "." will not be subsetted
unnest(t_test) %>% arrange(cond, group, sex)
splitting data by certain factors, e.g. here by condition (cond), group and sex
load(dplyr)
load(tidyr)
load(broom) # for tidy() compact formatting of stat fcns like t.test()
metrics.condXsexXgrp.ttest <- metrics.df %>%
group_by(cond, group, sex) %>%
summarise(res = list( tidy( t.test(metricX) ) ) ) %>%
unnest() %>% arrange(cond, group, sex)
## alternatively:
metrics.df %>%
nest(-cond, -group, -sex) %>% # creates a list-column with column name "data"
mutate(t_test = map(data, ~{ t.test(.x$metricX, mu = 0 ) %>% tidy() { )) %>% # ".x" to access data partitioned by cond/group/sex -- "." will not be subsetted
unnest(t_test) %>% arrange(cond, group, sex)
equal.variance: whether or not to assume equal variance. Default is FALSE.
# m1, m2: the sample means
# s1, s2: the sample standard deviations
# n1, n2: the same sizes
# m0: the null value for the difference in means to be tested for. Default is 0.
# equal.variance: whether or not to assume equal variance. Default is FALSE.
# returns a list with stats
t.test2 <- function(m1,m2,s1,s2,n1,n2,m0=0,equal.variance=FALSE)
{
sigdig <- 3
if( equal.variance==FALSE )
{
se <- sqrt( (s1^2/n1) + (s2^2/n2) )
# welch-satterthwaite df
df <- ( (s1^2/n1 + s2^2/n2)^2 )/( (s1^2/n1)^2/(n1-1) + (s2^2/n2)^2/(n2-1) )
} else
{
# pooled standard deviation, scaled by the sample sizes
se <- sqrt( (1/n1 + 1/n2) * ((n1-1)*s1^2 + (n2-1)*s2^2)/(n1+n2-2) )
df <- n1+n2-2
}
t <- (m1-m2-m0)/se
stats <- c(format(round(m1-m2,digits=sigdig),nsmall=sigdig),
format(round(df,digits=sigdig),nsmall=sigdig),
format(round(se,digits=sigdig),nsmall=sigdig),
format(round(t,digits=sigdig),nsmall=sigdig),
format(round(2*pt(-abs(t),df),digits=sigdig),nsmall=sigdig))
names(stats) <- c("Difference of means", "df","Std Error", "t", "p-value")
return(stats)
}
load(dplyr)
load(tidyr)
metrics.condXsexXgrp.ttest <- metrics.df %>%
group_by(cond, group, sex) %>%
summarise(res = list( tidy( t.test(metricX) ) ) ) %>%
unnest() %>% arrange(cond, group, sex)
By default, R uses dummy contrast coding (all levels contrasted against a reference level, which if not set is what is first alphabeticall).
R has 4 built-in contrasts:
dummy (each level compared to the reference level)
intercept = mean of the reference grp
e.g. for 4 levels: (R code contr.treatment(4)):
2 3 4
1 0 0 0
2 1 0 0
3 0 1 0
4 0 0 1
to relevel the baseline, say to the 3rd level:
contrasts(df$factor) = 'contr.treatment'(levels(df$factor), base = 3)
deviation (each level compared to grand mean)
intercept = grand mean of across all the levels in the grp
e.g. for 4 levels: (R code contr.sum(4)):c
Helmert (each level compared to the mean of the subsequent level of the variable)
intercept = mean of ref level
e.g. for 4 levels (R code: contr.helmert(4)):
1vs2+3+4 2vs3+4 3vs4
1 3/4 0 0
2 -1/4 2/3 0
3 -1/4 -1/3 1/2
4 -1/4 -1/3 -1/2
alternatively:
orthogonal polynomial (trend analysis for ordinal variable in which levels are equally spaced; R code: contr.poly(nlevs))
load(dplyr)
load(tidyr)
metrics.condXsexXgrp.ttest <- metrics.df %>%
group_by(cond, group, sex) %>%
summarise(res = list( tidy( t.test(metricX) ) ) ) %>%
unnest() %>% arrange(cond, group, sex)
Other common contrast codings:
simple (like dummy but only diff is the intercept)
intercept = grand mean
e.g. for 4 levels
c_dummy <- contr.treatment(4)
c_spec <- matrix(rep(1/4, 12), ncol=3)
(c_simple <- c_dummy - c_spec)
2 3 4
1 -0.25 -0.25 -0.25
2 0.75 -0.25 -0.25
3 -0.25 0.75 -0.25
4 -0.25 -0.25 0.75
contrasts(df$x) <- c_simple
lm(y ~ x, df) # lm will adopt defined contrast
Specify as many contrasts as you have levels in your group (degrees of freedom).
For example, 4 contrasts for 4 levels in factor:
1st 3 levels vs. last 2 levels: -1/3, -1/3, -1/3, 1/2, 1/2
3rd vs. 4th: 0, 0, 0, 1, -1
...
contrasts(df$factor) <- matrix(c(-1/3, -1/3, -1/3, 1/2, 1/2,
0, 0, 0, 1, -1,
0, 1, -1, 0, 0),
nrow = 5, ncol = 3)
df.lm <- lm(y ~ factor)
summary.aov(df.lm, split=list(Group=list("First3 vs. last2" = 1,
"3-4" = 2,
"2-3" = 3)))
Walked-through example on tds for a continuous factor, or an example on STHDA with categorical factors.
# from
# http://rrubyperlundich.blogspot.com/2015/07/1-label-outlier-in-ggplot2-boxplot.html
# add labels to outliers in a ggplot2 boxplot
# input: 1). p - ggplot boxplot obj (dataframe given to ggplot must contain x,y and labeling variable
# 2). labvar - (opt'l) string containing the name of the var containing the labels. Default is value itself
add.outlier <- function(p,labvar = as.character(p$mapping$y)){
df <- data.frame(y = with(p$data,eval(p$mapping$y)),
x = with(p$data,eval(p$mapping$x)))
df.l <- split(df,df$x)
mm <- Reduce(rbind, lapply(df.l,FUN = function(df){
data.frame(y = df$y[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))],
x = df$x[df$y <= (quantile(df$y)[2] - 1.5 * IQR(df$y)) | df$y >= (quantile(df$y)[4] + 1.5 * IQR(df$y))]
)})
)
## save plots in a list
plot.l <- list()
i <- 0
for (m in loopThruMe.l) { ## loopThruMe.l is some list of things to loop through
i <- i+1
plot.l[[i]] <- somePlotFcn(someArg, m)
}
## arranges everything in one figure;
## no easy config for a common legend
do.call(gridExtra::grid.arrange,
c(plot.l, ncol=someNoCols, top="a title"))
## default nrow/ncol=1 for what is not specified and creates multi-figure;
## plots with specified # of cols and/or rows;
## easy option for common legend
do.call(ggpubr::ggarrange,
c(plot.l,
list(ncol=someNoCols, common.legend=T) )
Only base R pays attention to given font sizes...Check out very informative info page on saving figures in R from the University of Marburg's learning platform, which offers the following guidelines for creating your figures:
width: 8.30 cm (one-column images), 17.35 cm(two-column images)
maximum height: 23.35 cm (caption will not fit on the same page then)
minimum resolution: 300 ppi
compression: LZW
color mode: RGB (millions of colors), 8 bits per channel
background: white, not transparent
font type, size: Arial, Times or Symbol, 6 to 12 pt
lines: line width between 0.5 to 1.5 pt
white space: a 2 pt white space around each figure is recommended
file size: 10 MB max
fig.w.2col_cm <- 17.3
fig.w.1col_cm <- 8.30
fig.h.cm <- 23.35
fig.w.2col_in <- 6.8
fig.w.1col_in <- 3.2
## e.g. saving a ggplot2 object gg:
## ggsave(file.sep(dir_analyses,"RT_hist.pdf"),
## gg,
## width = fig.w.2col_in *.8, height=6.5)
Also to streamline and facilitate consistent figures, I define variables for ggplot2 layers and themes which I use each time I generate a plot, e.g.
rgb_PT <- "pink3"
rgb_HC <- "darkslategray"
cfg.plot,pch <- c(19,1) # solid circle, unfilled circle
## define ggplot2 layers
cfg.plot,col.gg <- scale_color_manual(values = c(col_HC, col_PT))
cfg.plot.fill.gg <- scale_fill_manual(values = c(col_HC, col_PT))
cfg.plot.pch.gg <- scale_shape_manual(values = plot_pch)
theme_proj1.gg <- ggplot2::theme_bw() + ## now gray plot bkgd
ggplot2::theme(strip.text = element_text(size=rel(0.8)), # font size of facets
axis.text.x = element_text(size=rel(0.7)),
axis.text.y = element_text(size=rel(0.7)),
axis.title.x = element_text(size=rel(0.8)),
axis.title.y = element_text(size=rel(0.8)),
legend.title = element_text(size=rel(0.8)),
legend.text = element_text(size=rel(0.7)),
legend.key = element_blank(), # remove fill in legend
plot.title = element_text(hjust = 0.5, size=11))
Sometimes hard to visualize what an array of colors look like for your plots.
Resources:
R chart's color resources for some predefined color names and corresponding HEX code or convert between HEX and RGB.
ColorBrewer for interactive and more complex coloring (like for cartography)
Data Novia's guide to choosing color palettes in R for visualizing data
Otherwise, there are various ways to visualize other than just plotting with those colors, but the scales::show_col make it light work:
col_3prs <- c("cornsilk", "navajowhite", "lightskyblue", "deepskyblue", "darkorange", "darkorange3") # choose one of the 657 colors in R with predefined color names
scales::show_col(col_3prs)
Show memory usage of your data objects:
showMemoryUse(decreasing=TRUE, limit=10)
Things to watch out for.
paste() with no delimiter (sep="") specified separates concatenated strings by a space by default. Use paste0() to concatenate with no whitespace
When you merge only specific columns from each dataframe, make sure you include the factor you merge by. Otherwise it throws a very unhelpful error.
Make sure your factors have the data type factor (in particular for any random effects terms) -- otherwise ANOVA functions from certain functions will give you weird results.
For example, dplyr and plyr use functions with the same name. If you do not know why things aren't working like they used to, it might have something to do with the order you have loaded your packages (with the last one overriding the functions defined in an earlier loaded package. For the conflict between dplyr and plyr, specifically, usually you want dplyr for the additional functions it provides (e.g. rename) so load plyr first and then dplyr so that dplyr doesn't break.
To avoid this problem in general, call functions from particular packages by specifying the namespace (PKG::FUNCTION):
iris %.% group_by(Species) %.% dplyr::summarize(p=mean(Petal.Length))
Check the help page on double colon and triple colon operators for the difference between PKG::FUNCTION and PKG::FUNCTION.
Plot or variable not being displayed like they usually would outside of the loop or function? You need to wrap the plot object/variable in print()
You need to use the operator := instead of =
R Markdown in RStudio takes commenting your code to another level -- it makes code writing and reporting seamless (kind of) with code chunks and inline output.
However....
Do not create one big .Rmd file
When I first started using R Markdown, I wanted to keep all my analysis attempts in one comprehensive R Markdown report. I would save different images into different Rdata and Rhistory, and try and save to my html/pdf in chunks as I went along rather than knitting the entire report in the end all at once -- but this never worked out (see below). In the end my code in my Rmd was too flat and the heavy processes meant I would get memory errors before it would knit into a complete report (because inevitably the section I had run to get output would be cleared no matter what I did -- see below) -- so don't do it! Break it up. Modularize your code and processes.
Put R code that is not important to your report as functions in a separate .R file and source() it from your .Rmd file
Add a date and table of contents to your report by adding these lines to your YAML header:
date: "`r format(Sys.time(), '%d %B %Y')`"
output:
pdf_document:
highlight: tango
keep_tex: true
number_sections: true
toc: true
html_notebook:
toc: true
toc_float: true
number_sections: true
Do you really need R Markdown? Some heavy processes should really just be R scripts instead, or at the most R Notebooks (see Tips for Heavy Processes below).
R Notebook might be better suited for
exploratory analyses and keeping track of all the things you have tried
R Markdown
creating reports
taking the exploratory analyses from R Notebook that you want to keep and creating final reports from those
R Notebook = R Markdown in....
the markup language (the only thing that indicates that it is one and not the other is the YAML header)
YAML header for R Notebook: ouptut: html_notebook
YAML header for R Markdown -- one of these:
ouptut: html_document
output: pdf_document
output: word_document
and many more options including HTML presentation slides, interactive dashboards
the file extension
R Notebook vs. R Markdown (differences)
sending code to console
R Notebook will send code lines one at a time (allowing you to run separate chunks at a time and stop when a line raises an error), whereas R Markdown sends all the code to the console (even though you can execute/test single chunks at a time.
R Notebook is "previewed" i.e. will generate an HTML preview of your the HTML view of your code without running any R code chunks.
R Markdown is "knitted" and sends all the code chunks to console when it generates a report. Even though it generates intermediate reports in whatever file format you have specified (html, pdf, Word doc) whenever you save your Rmd, you cannot rely on this being a complete report because R Studio sometimes clears outputs (see Cleared Outputs below).
commenting
R Notebook allows multiline commenting <!--, --->
R Markdown only allows single-line commenting with prepending each comment line by #
R Markdown is heavier than running scripts in a terminal, and it can crash. Specifically it will report that it needs to restart R and in doing so sometimes) it is not suited to heavy processes. Advice to deal with this can be found in this forum thread, but to summarize and integrate the collective suggestions from the R community:
For code with heavy processes, R Notebook is more suited than R Markdown.
Not knitting certain code chunks: add {r eval=FALSE } to your code chunk
Move the analysis-heavy part of the code outside the R Markdown document completely to an R script (it also makes debugging easier). The script must save the results of the analysis to some file, e.g., .csv, .rds, .rda or a feather file. Then in the R Markdown you can add a chunk with an if statement, which checks if the file exists (in which case it loads it) otherwise it sources the analysis script. This way, knitting takes way less time.
Put long-running processes in the setup chunk (test in console) along with intermediate results that you would want to keep in the final report. When ready for what code chunks to keep with accompanying text, saveRDS the object that took forever to a public repository, either on GIT or S3, comment out all the program and add
con <- path_to_rds
load(con)
close(con)
That gives you everything you need to write/knit/write/knit without waiting forever for the code to run.
save.image() different sections of your R Markdown into different .RData's
NOTE: Setting the working directory within a code chunk in your .Rmd does not actually set the working directory outside of that code chunk, you need to run setwd() in the console manually.
use drake
designed for data science -- avoid the Sysphean loop of waiting forever for code to run, hitting a problem and rerunning from scratch (with drake, what was run will not be rerun)
uses plans to schedule work, detects dependency relationships between targets to be able to run targets in parallel
does require a different way of thinking -- get started on the drake reference website (with intro video and code, where to go for more help)
I have found that RStudio will clear the output or your RMarkdown output file periodically, so even though I don't knit my entire Rmd but am hoping to only need to run code chunks one at a time (and when you save your Rmd, it actually updates the output report file in the format you have specified), I get foiled. Inline outputs in your Rmd sometimes gets cleared when
RStudio throws an error and wants to restart R.
you decide you want to globally replace the name of a variable.
when you move code chunks around.
Triggers for the error: when you create a new column in a dataframe using
df$x_sc <- scale(df$x)
or
df <- df %>% mutate(x_sc = scale(x))
then fit a model with that column and use something like the predict() or a function from the ggeffects or sjPlot package which calls predict().
When you do a str(df), you will see that df$x_sc is a matrix:
$ x_sc : num [1:126, 1] 0.5 1 ...
$ x : int 2.3 4 ...
Apparently, to quote this forum post:
It's a feature and it's been there forever. (It's even present in
another system not unlike R.)
Suppose you set
y <- matrix(1:3)
and construct
dfr <- data.frame(x=1:3, y)
Then you invoke the constructor function, data.frame, which by default
simplifies things like matrices to single columns, naming them as
necessary.
Now if you directly modify dfr by adding another component, like
dfr$yy <- y
You bypass the constructor function and its default simplifications, but
you do not bypass the structure tests. This is, in fact the simplest
way to put a matrix inside a data frame intact, but it must have the
same number of rows as has the data frame itself.
One work-around is to add the new column this way:
df['x_sc'] <- as.data.frame(scale(df$x))
The following, by the way, does not work:
df$x_sc <- as.data.frame(scale(df$x))
Among other things more easily more commonly talked about in forums, check that you do not have NULL objects in your plot list -- if so, remove them from the list:
plot.list [ lengths(plot.list) == 0] <- NULL
do.call(gridExtra::grid.arrange, c(plot.list, list(ncol = someNoCols))
## alternatively:
do.call(ggpubr::ggarrange, c(plot.list,
list(ncol=someNoCols, common.legend=T)))
One possibility is that you have a row of NA -- filter these out before using kable().
A list of R Cheatsheets (e.g. for purrr:apply, data import, data transformations, RStudio, data visualization)
Get help from the RStudio Community
Looking for an R package that will do what you want? Search here on rseek.org