Data Frames in R
A data frame in R is one of the most commonly used data structures for data analysis. It is a table-like structure, similar to a spreadsheet or SQL table, and is designed to store data in rows and columns. Unlike matrices, data frames can hold different types of data in each column, making them ideal for working with datasets containing mixed data types.
A data frame can be created using the data.frame() function, where each argument represents a column in the data frame. Each column can be a vector, factor, or other data structure, and all columns should have the same length (number of rows).
Example:
R
Copy code
# Create a data frame with different types of data
my_data <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(25, 30, 35),
Score = c(95.5, 85.0, 90.5),
Married = c(TRUE, FALSE, TRUE)
)
# Print the data frame
print(my_data)
The resulting data frame looks like this:
graphql
Copy code
Name Age Score Married
1 Alice 25 95.5 TRUE
2 Bob 30 85.0 FALSE
3 Charlie 35 90.5 TRUE
Important points about data frames:
Data frames can store different types of variables (e.g., numeric, character, logical).
All columns must have the same number of rows.
Each column is internally a vector.
You can access elements in a data frame using several methods, including indexing by row/column numbers, column names, or both.
Accessing by Column Name:
R
# Access a single column by name
print(my_data$Name) # Prints the 'Name' column
Accessing by Column Index:
R
# Access a column by index
print(my_data[, 2]) # Prints the 2nd column (Age)
Accessing Specific Row and Column:
R
# Access an element from a specific row and column
print(my_data[2, 3]) # Accesses the 2nd row and 3rd column (Score)
Accessing Multiple Rows and Columns:
R
# Access specific rows and columns using a range
print(my_data[1:2, c(1, 3)]) # Rows 1 and 2, columns 'Name' and 'Score'
Adding Columns:
You can add a new column to a data frame by simply assigning a new vector to a column name.
R
# Add a new column to the data frame
my_data$Height <- c(5.5, 6.0, 5.8)
print(my_data)
Modifying Existing Columns:
You can modify the values in an existing column by assigning new values.
R
# Modify the 'Age' column
my_data$Age <- c(26, 31, 36)
print(my_data)
Adding Rows:
To add rows, you can use the rbind() function (row bind).
R
# Add a new row to the data frame
new_row <- data.frame(Name = "David", Age = 40, Score = 88.5, Married = FALSE, Height = 5.9)
my_data <- rbind(my_data, new_row)
print(my_data)
You can use several functions to explore the structure and content of a data frame:
head(): Displays the first few rows of the data frame.
R
head(my_data) # Shows the first 6 rows
tail(): Displays the last few rows of the data frame.
R
tail(my_data) # Shows the last 6 rows
str(): Displays the structure of the data frame, including the types of columns and the first few values.
R
Copy code
str(my_data)
summary(): Provides a summary of each column (e.g., min, max, mean for numeric columns).
R
summary(my_data)
dim(): Returns the dimensions of the data frame (number of rows and columns).
R
dim(my_data) # Returns (rows, columns)
nrow() and ncol(): Return the number of rows and columns in the data frame, respectively.
R
nrow(my_data) # Number of rows
ncol(my_data) # Number of columns
Subsetting Rows and Columns:
You can subset a data frame based on conditions.
Subset by Condition:
R
Copy code
# Subset rows where Age is greater than 30
subset_data <- my_data[my_data$Age > 30, ]
print(subset_data)
Subset by Column Name:
R
# Subset specific columns (Name and Score)
subset_data <- my_data[, c("Name", "Score")]
print(subset_data)
Sorting:
You can sort the data frame by one or more columns using the order() function.
R
Copy code
# Sort by 'Age' in ascending order
sorted_data <- my_data[order(my_data$Age), ]
print(sorted_data)
# Sort by multiple columns (e.g., by 'Age' then by 'Score')
sorted_data <- my_data[order(my_data$Age, my_data$Score), ]
print(sorted_data)
Filtering Using dplyr:
The dplyr package is commonly used to filter and manipulate data frames in a more intuitive way.
Installing and loading dplyr:
R
Copy code
install.packages("dplyr")
library(dplyr)
Filtering rows:
R
filtered_data <- my_data %>% filter(Age > 30)
print(filtered_data)
Selecting columns:
R
Copy code
selected_columns <- my_data %>% select(Name, Score)
print(selected_columns)
Converting a Data Frame to a Matrix:
You can convert a data frame to a matrix using the as.matrix() function. However, the column types need to be consistent (all numeric or all character).
R
my_matrix <- as.matrix(my_data[, c("Age", "Score")])
print(my_matrix)
Converting a Data Frame to a List:
You can convert a data frame to a list using the as.list() function.
R
my_list <- as.list(my_data)
print(my_list)
Data Import: Data frames are often used to store data imported from external files like CSVs, Excel sheets, or databases.
R
my_data <- read.csv("data.csv")
Data Manipulation: Data frames are essential for cleaning and transforming data in tasks like filtering, grouping, summarizing, and merging datasets.
Analysis: Data frames are ideal for statistical analysis and machine learning tasks, where rows typically represent observations, and columns represent variables.
Data Frames: Table-like structures with columns of different types (numeric, character, logical, etc.).
Access: Use $, [], or [[ ]] to access data.
Operations: Subsetting, sorting, and filtering can be done with base R functions or packages like dplyr.
Flexibility: Data frames are highly versatile and used in almost all data analysis tasks in R.