Data preparation by using R programming language

How to initialize a blank r notebook:

In order to open a new notebook with R language in Google Colab environment, use this link: https://colab.research.google.com/notebook#create=true&language=r

Loading R libraries:

In this section in order to import dataset into R framework and be able to manipulate and extract information from them we need to load three different libraries as:

  1. readr: read csv files into R dataframes.

  2. stringr: help to manipulate column names in loaded dataframe.

  3. dplyr: help to manipulate dataset and summarize the information.

Importing data into R notebook:

In order to load a file into R notebook, click on Files and then Upload to session storage and finally choose your file, which is the Breast-Cancer-Data.csv in this section:

Loading data into R notebook by using read_csv:

In order to read csv (comma separated value files) into R dataframes, we use read_csv function from readr library:

Correcting column names to avoid spaces in R:

R language does not tolerate spaces in the name of the columns associated with dataframes. As a result, the spaces should be replaced by dots in order to avoid unnecessary errors. The stringr library help to identify spaces and replace them with dots:

Select, filter, and mutate the dataset:

  • Select: select function from dplyr package is helpful when you want to select specific columns based on their names.

  • Filter: filter function is useful when you want to filter the dataset based on a certain condition or multiple conditions based on values of columns.

  • Mutate: mutate function will be used for creating new variables based on combination of existing columns.

Summarizing dataset based on aggregating columns:

In order to aggregate and summarize dataset based on for example taking mean of columns, we use summarize function (note that na.rm = TRUE removes missing values in columns during averaging or taking maximum or minimum):

The R notebook to reproduce these results is shared here: https://colab.research.google.com/drive/1bwSeTn4FN6892DIawn31G-V2YIa9zZBO?usp=sharing