Here are 100 clear, practical lines on R (the language of data) for your systematic study and reference:
R is an open-source programming language primarily for statistical computing and graphics.
It was created by Ross Ihaka and Robert Gentleman in the early 1990s in New Zealand.
R is widely used in data science, bioinformatics, and statistical research.
It is part of the GNU project, and its syntax is similar to the S language.
R is interpreted, not compiled, making it easy to test and iterate code.
The base R installation includes data manipulation, calculation, and graphical display functions.
R uses vectors as its basic data structure, allowing vectorized operations.
It has data structures including vectors, lists, data frames, and matrices.
Data frames in R are similar to spreadsheets and are used heavily for tabular data.
Factors are used in R to handle categorical data efficiently.
The assignment operator in R is <- but = can also be used.
Comments in R are added using the # symbol.
R has built-in datasets for practice, like mtcars, iris, and airquality.
Functions in R are defined using the function keyword.
R can be used for descriptive statistics, hypothesis testing, and regression analysis.
Packages extend R’s capabilities, and CRAN hosts thousands of these packages.
Popular packages include dplyr, ggplot2, tidyr, readr, and caret.
You can install packages using install.packages("package_name").
Load a package using library(package_name).
dplyr is used for data manipulation with functions like filter(), select(), and mutate().
ggplot2 is the go-to package for advanced data visualizations in R.
R supports advanced plotting, including scatterplots, bar plots, histograms, and boxplots.
The pipe operator %>% from magrittr allows chaining of functions.
Data can be read into R using read.csv(), read.table(), or the readr package for faster I/O.
You can write data out using write.csv() or write.table().
Missing data in R is represented by NA, and handled with functions like is.na() and na.omit().
R supports conditional statements using if, else, and else if.
Loops in R include for, while, and repeat.
However, vectorized operations and apply functions are preferred over loops for efficiency.
apply(), lapply(), sapply(), and tapply() are powerful for applying functions across data structures.
R is case-sensitive, so x and X are treated as different variables.
You can get help using ?function_name or help(function_name).
RStudio is the most popular IDE for R, providing a clean environment and advanced features.
summary() gives descriptive statistics of data objects.
str() shows the structure of an object, which is useful for exploration.
head() and tail() display the first and last few rows of data, respectively.
R can handle various file formats, including CSV, Excel, and JSON.
The readxl package allows reading Excel files into R easily.
R can connect to SQL databases for querying large datasets using DBI and RMySQL.
R supports statistical models, including linear regression with lm().
Logistic regression can be performed using glm() with family = binomial.
R can produce publication-quality plots with customization options in ggplot2.
Faceting in ggplot2 enables multi-panel plots for subgroup analysis.
tidyr is used to tidy messy data, making it suitable for analysis.
gather() and spread() in tidyr reshape data between long and wide formats.
Data cleaning is often performed using dplyr and tidyr together.
R supports string manipulation using the stringr package.
Regular expressions can be used for pattern matching in R with grep(), grepl(), and sub().
Date and time manipulation is handled well with the lubridate package.
You can visualize correlation using corrplot or ggcorrplot.
R supports creating reproducible reports using R Markdown.
R Markdown can render outputs as HTML, PDF, and Word documents.
R supports building interactive dashboards with shiny.
Shiny enables the creation of web applications for live data analysis in R.
R has machine learning packages like caret, randomForest, and xgboost.
caret provides a unified interface for training and tuning machine learning models.
R allows cross-validation of models using trainControl() in caret.
Decision trees can be built using rpart in R.
Clustering techniques like k-means and hierarchical clustering are easily implemented in R.
R can perform Principal Component Analysis (PCA) using prcomp().
R can interface with Python using the reticulate package.
It can also call C, C++, and Fortran code for performance-critical tasks.
R can generate random numbers from various distributions for simulations.
It includes probability distribution functions like dnorm(), pnorm(), qnorm(), and rnorm().
You can customize plots with themes, colors, and layers in ggplot2.
The plotly package adds interactivity to R plots.
R is used extensively in academic research and by statisticians.
Many government agencies and financial institutions use R for data analysis.
You can create custom functions in R to automate repetitive tasks.
Error handling in R can be done using tryCatch().
Memory management in R is automatic, but large datasets can require careful handling.
R can be slow with large datasets; packages like data.table improve performance.
data.table is a high-performance version of data frames with faster operations.
R supports functional programming concepts with purrr.
purrr provides tools like map(), map_df(), and map_dbl() for iteration.
R allows unit testing using the testthat package.
Package development in R can be done with devtools.
R’s community is active, with support on Stack Overflow and R-bloggers.
R offers strong visualization capabilities for exploratory data analysis (EDA).
You can calculate correlation with cor() and covariance with cov().
The broom package converts model outputs into tidy data frames for reporting.
R can automate report generation with parameterized reports in R Markdown.
Version control for R projects is typically handled with Git.
R projects can be structured using renv for dependency management.
R is highly extensible, allowing you to write your own packages.
You can deploy Shiny apps to the web using shinyapps.io.
R’s graphic system includes base graphics and grid graphics, with ggplot2 using the latter.
You can create animations in R using the gganimate package.
R can scrape data from the web using rvest and httr.
You can connect to APIs and parse JSON using jsonlite.
R can be used for geospatial analysis with packages like sf and leaflet.
Time series analysis in R is done using ts, forecast, and xts packages.
ARIMA models can be built with the forecast package.
Seasonal decomposition can be performed using stl() or decompose().
R is widely taught in universities for data analysis and statistics courses.
R has strong capabilities for financial modeling and risk analysis.
You can export plots from R in multiple formats: PNG, PDF, SVG, and JPEG.
R supports the integration of LaTeX in reports for mathematical notations.
The flexibility and simplicity of R make it ideal for prototyping data analysis workflows.
R is a powerful tool in a data scientist’s toolkit, empowering users to turn data into actionable insights.
Here is a clear, practical list of 10 small projects to code in R for your portfolio to demonstrate data wrangling, analysis, visualization, and applied modeling:
Dataset: mtcars, iris, or Kaggle’s Titanic dataset.
Tasks: Clean the data, visualize distributions, plot correlations, and generate insights.
Skills Demonstrated: Data cleaning, dplyr, ggplot2, basic reporting.
Create a Shiny app to visualize COVID-19 trends, sales data, or weather data.
Features: User inputs for filters (date ranges, regions), dynamic plots, summary tables.
Skills Demonstrated: Shiny, reactive programming, interactive visualization.
Use historical stock prices or temperature data to forecast future values.
Apply ARIMA or ETS models, visualize forecasts with confidence intervals.
Skills Demonstrated: forecast, ts, tseries, time series decomposition.
Scrape Amazon product reviews or Twitter hashtags using rvest or rtweet.
Perform sentiment analysis and visualize positive vs. negative sentiment trends.
Skills Demonstrated: rvest, tidytext, data cleaning, text mining.
Use a dataset with customer purchasing behavior to segment customers into groups.
Visualize clusters using PCA or 2D scatter plots with cluster labels.
Skills Demonstrated: Clustering, unsupervised learning, data visualization.
Plot Airbnb listings or crime data on an interactive map using leaflet.
Add filters (price range, property type, crime type) to the interactive map.
Skills Demonstrated: Geospatial analysis, interactive visualization.
Predict house prices (linear regression) or predict Titanic survival (logistic regression).
Evaluate model performance using accuracy, RMSE, and confusion matrices.
Skills Demonstronstrated: Model building, evaluation, interpretation.
Create a reusable R Markdown report pipeline that generates PDFs or HTML reports.
Example: A monthly sales report with automated graphs and summaries.
Skills Demonstrated: Reproducible reporting, rmarkdown, automation.
Take a messy dataset (with NAs, inconsistent formatting, incorrect types) and clean it thoroughly.
Create a before-and-after summary with visuals of data improvements.
Skills Demonstrated: dplyr, tidyr, data cleaning best practices.
Take a dataset (e.g., health indicators) and generate a correlation heatmap.
Analyze which features are most correlated with the target variable.
Skills Demonstrated: corrplot, EDA, feature selection insights.
Here is clean, beginner-friendly R code for Exploratory Data Analysis (EDA) on the iris dataset for your portfolio, using tidyverse for modern workflow:
r
CopyEdit
# Load required libraries
library(tidyverse)
library(ggplot2)
library(GGally)
# Load the dataset
data(iris)
# 1. View structure and first few rows
str(iris)
head(iris)
# 2. Summary statistics
summary(iris)
# 3. Check for missing values
colSums(is.na(iris))
# 4. Univariate Analysis: Histograms
iris %>%
gather(key = "Variable", value = "Value", -Species) %>%
ggplot(aes(x = Value, fill = Species)) +
geom_histogram(bins = 30, alpha = 0.6, position = "identity") +
facet_wrap(~Variable, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Histograms of Numeric Variables", x = "", y = "Count")
# 5. Boxplots to check distributions by Species
iris %>%
gather(key = "Variable", value = "Value", -Species) %>%
ggplot(aes(x = Species, y = Value, fill = Species)) +
geom_boxplot(alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 2) +
theme_minimal() +
labs(title = "Boxplots of Numeric Variables by Species", x = "", y = "")
# 6. Pair plot for correlation and scatter relationships
ggpairs(iris, aes(color = Species, alpha = 0.5),
title = "Pairwise Scatterplot Matrix")
# 7. Correlation heatmap for numeric variables
cor_matrix <- cor(iris %>% select(-Species))
ggplot(melt(cor_matrix), aes(Var1, Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1, 1), space = "Lab",
name = "Correlation") +
theme_minimal() +
labs(title = "Correlation Heatmap", x = "", y = "") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1))
# 8. Grouped summary statistics
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd), .names = "{col}_{fn}")) %>%
print()
# 9. Simple bar chart: Species count
iris %>%
count(Species) %>%
ggplot(aes(x = Species, y = n, fill = Species)) +
geom_col() +
theme_minimal() +
labs(title = "Count of Each Species", y = "Count")
# 10. Save cleaned summary to CSV for reference
summary_df <- iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), list(mean = mean, sd = sd)))
write.csv(summary_df, "iris_summary.csv", row.names = FALSE)
✅ Uses built-in iris dataset (safe, accessible, no download required).
✅ Covers loading, structure check, missing value analysis.
✅ Generates histograms, boxplots, pair plots, and correlation heatmap.
✅ Shows grouped summaries by species for insight extraction.
✅ Saves a summary CSV for reporting or dashboard integration.