Introduction to ggplot2

In order to reproduce the visualization given in this section, you need to install several R packages in Google Colab notebook environment. The complete list of packages installed is given in this notebook to ensure reproducibility: https://colab.research.google.com/drive/1oWmDt341vjpZC21M4I3vK3QTGNdjUz_O?usp=sharing

First working example of using ggplot2 in R:

The first example is to plot two variables of mean radius versus worst area in Breast Cancer dataset as a scatterplot. Here we introduce ggplot2 package that helps us to create the visualization:

  • First we read the Breast Cancer dataset by using read_csv function.

  • Then we remove spaces in the columns names to avoid errors by using str_replace_all function.

  • Finally ggplot and geom_point functions help us to visualize the scatter plot.

The scatterplot looks like this:

Changing size, color, and transparency of points:

The geom_point function has attributes color that changes the color of points, alpha that makes the points transparent, and finally size that changes the size of the points:

The visualization is:

Add regression line and color the points with different attributes:

Based on the above plot, we see that the region where mean radius is less than 15, the relation between mean radius and worst area is almost linear. As a result, first we filter the data to limit ourselves into the region where mean radius < 15:

filteredBreastCancer <- filter(BreastCancer, mean.radius < 15)

Then we are able to use geom_smooth function with "lm" option to fit a linear equation into our data. Finally, we need to color the point based on their different Class where they belong to benign or malignant classes:

The final visualization where shows different classes when mean radius versus worst area is plotted as well as showing the linear fitted lines for each class separately:

Bar chart shows number of cases of covid-19 in south carolina:

The next dataset that is used in this section for visualization of bar chart is the number of cases of COVID-19 in South Carolina state at county level:

dataSC <- read_csv("data-sc.csv")

Furthermore, we limit our visualization to these counties that are filtered as:

filteredDataSC <- filter(dataSC, County == 'Greenville' | County == 'Pickens' | County == 'Lexington' | County == 'Aiken' | County == 'Oconee')

We calculate the percentage of cases in each filtered county as:

mutatedFilteredDataSC <- mutate(filteredDataSC, pct = Cases / sum(Cases), pctlabel = paste0(round(pct*100), "%"))

Finally, we put the percentage at the top of the bar charts and also sort the counties in a descending order based on their number of cases:

The final visualization looks like this:

Pie chart shows number of cases of covid-19 in south carolina:

The above bar chart could be converted to a pie chart by calculating the angular portion of each bar:

The final pie chart is:

Choropleth map of covid-19 cases across the us and south carolina:

The dataset of number of cases of COVID-19 across the US at the county level is located in us-covid-19.csv file:

dataUS <- read_csv("us-covid-19.csv")

state_choropleth function from choroplethr package needs to have two columns in a dataframe with region that represent the states and value that represent the number of COVID-19 cases in our example. Furthermore, we aggregate data in each state by summing up all the values within a state on over all the counties in that state:

Finally, the map is plotted by using this command:

The final choropleth visualization is:

Now, we want to import the shape and position of South Carolina counties:

data(county.regions,

package = "choroplethrMaps")

region <- county.regions %>%

filter(state.name == "south carolina")

Finally the data across the US is filtered to contain only the counties in South Carolina:

dataUS$county <- tolower(dataUS$county)

dataUSSC <- filter(dataUS, state == "South Carolina")

plotdata <- inner_join(dataUSSC,

region,

by=c("county" = "county.name"))

Now, we plot the number of cases of COVID-19 across different counties of South Carolina as:

The final visualization looks like this:

line plot for time-dependent data:

Line plots are suitable for visualization of time-dependent datasets. One particular example used in this study is the personal savings rate versus time in the US from 1967 to 2015. In this example we used geom_line functon to plot the line:

The final visualization of the line plot is:

statistical models: correlation plot, linear and logistic regressions:

Correlation plot shows the correlation between various variables. For breast cancer dataset the correlation is calculated as:

df <- dplyr::select_if(BreastCancer, is.numeric)

r <- cor(df, use="complete.obs")

round(r,2)

Note that the correlation matrix is a symmetric matrix. As a result, the only lower triangular portion of this correlation matrix is visualized as:

ggcorrplot(r,

hc.order = TRUE,

type = "lower",

lab = FALSE)

The next example applies linear regression on breast cancer dataset to find a relation between texture error versus mean radius, mean texture, worst area, worst perimeter, mean perimeter, concavity error, and worst radius:

breastcancer_lm <- lm(texture.error ~ mean.radius + mean.texture + worst.area +

worst.perimeter + mean.perimeter + concavity.error +

worst.radius,

data = BreastCancer)

Finall, it is possible to find the best fitted line to show the relation between texture error and for example mean radius that tells us that texture error is decreasing by increasing the mean radius when all the other parameters are kept constant:

Logistic regression is suitable for classification of binary variables based on other continuous parameters. In breast cancer dataset the class of cells which represent benign or malignant classes is a binary variable that is converted to 0 and 1:

BreastCancer$Class <- factor(BreastCancer$Class)

BreastCancer$Class <- as.numeric(BreastCancer$Class)

BreastCancer <- mutate(BreastCancer, Class = Class - 1)


Finally the logistic regression is done by using this command:

breastcancer_glm <- glm(Class ~ mean.radius + mean.texture + worst.area +

worst.perimeter + mean.perimeter + concavity.error +

worst.radius,

family="binomial",

data=BreastCancer)

Finally, the probability of class versus mean radius is plotted while all other parameters are kept constant:

visreg(breastcancer_glm, "mean.radius",

gg = TRUE,

scale="response") +

labs(y = "Prob(Class)",

x = "Mean Radius")

The receiving operating curve (ROC) that shows the accuracy of logistic regression model is plotted as:

df <- data.frame(Class = BreastCancer$Class, Prob = breastcancer_glm$fitted.values)

ggplot(df, aes(d = Class, m = Prob)) + geom_roc()

3d scatterplot:

3D scatterplots are useful for showing more than 2 dimensions. An example of 3D scatterplot to show mean radius, mean area, and worst radius versus each other colored by Class parameter is shown here:

with(BreastCancer, {

scatterplot3d(x = mean.radius,

y = mean.area,

z = worst.radius,

# filled blue circles

color=Class,

pch=19,

# lines to the horizontal plane

type = "h",

main = "3-D Scatterplot Breast Cancer Data",

xlab = "Mean Radius",

ylab = "Mean Area",

zlab = "Worst Radius")

})

scatterplot matrix:

Scatterplot matrix is useful for showing several variables versus each other in a single visualization. Here we choose mean radius, mean perimeter, worst area, and worst perimeter from breast cancer dataset:

df <- select(BreastCancer, mean.radius, mean.perimeter, worst.area, worst.perimeter)

Now the scatterplot matrix as well as their linear regression and correlation matrix are shown in a single visualization:

ggpairs(df,

lower=list(continuous = my_scatter),

diag = list(continuous = my_density)) +

labs(title = "Mammal size and sleep characteristics") +

theme_bw()

The final notebook to reproduce these visualizations is provided here: https://colab.research.google.com/drive/1oWmDt341vjpZC21M4I3vK3QTGNdjUz_O?usp=sharing