pandas (Gemini)
Pandas is a powerful, open-source data manipulation and analysis library for Python. It provides data structures like DataFrames and Series, which are optimized for handling tabular data.
1. Installation
If you don't have pandas installed, use pip:
Bash
pip install pandas
2. Core Data Structures: Series & DataFrame
Series
A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It's like a single column of a spreadsheet or a SQL table.
Python
import pandas as pd
# Creating a Series
s = pd.Series([10, 20, 30, 40, 50])
print("Series 's':\n", s)
# Series with a custom index
s_indexed = pd.Series([100, 200, 300], index=['a', 'b', 'c'])
print("\nSeries 's_indexed':\n", s_indexed)
# Accessing elements
print("\nElement at index 0:", s[0])
print("Element at label 'b':", s_indexed['b'])
DataFrame
A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet or a SQL table, or a dictionary of Series objects.
Python
# Creating a DataFrame from a dictionary
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [24, 27, 22, 32],
'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']
}
df = pd.DataFrame(data)
print("\nDataFrame 'df':\n", df)
# Creating a DataFrame from a list of lists (with columns and index)
df_list = pd.DataFrame(
[[1, 'Apple'], [2, 'Banana'], [3, 'Orange']],
columns=['ID', 'Fruit'],
index=['a', 'b', 'c']
)
print("\nDataFrame 'df_list':\n", df_list)
3. Basic DataFrame Operations
Viewing Data
Python
# Display the first 5 rows (default)
print("\nFirst 2 rows of 'df':\n", df.head(2))
# Display the last 3 rows
print("\nLast 3 rows of 'df':\n", df.tail(3))
# Get quick summary statistics
print("\nInfo on 'df':")
df.info()
print("\nDescription of 'df':\n", df.describe())
Selection (Column & Row)
Python
# Select a single column (returns a Series)
names = df['Name']
print("\n'Name' column:\n", names)
# Select multiple columns (returns a DataFrame)
name_age = df[['Name', 'Age']]
print("\n'Name' and 'Age' columns:\n", name_age)
# Select rows by label (loc)
row_alice = df.loc[0] # Selects row with index label 0
print("\nRow for Alice (using loc):\n", row_alice)
# Select rows by integer position (iloc)
row_bob = df.iloc[1] # Selects the second row (0-indexed)
print("\nRow for Bob (using iloc):\n", row_bob)
# Select specific rows and columns
subset = df.loc[1:3, ['Name', 'City']]
print("\nSubset (rows 1-3, Name & City):\n", subset)
Filtering Data
Python
# Filter rows where Age > 25
older_than_25 = df[df['Age'] > 25]
print("\nPeople older than 25:\n", older_than_25)
# Multiple conditions
ny_or_la = df[(df['City'] == 'New York') | (df['City'] == 'Los Angeles')]
print("\nPeople from New York or Los Angeles:\n", ny_or_la)
Adding/Modifying/Deleting Columns
Python
# Add a new column
df['Gender'] = ['F', 'M', 'M', 'M']
print("\nDataFrame after adding 'Gender':\n", df)
# Modify an existing column
df['Age'] = df['Age'] + 1
print("\nDataFrame after incrementing 'Age':\n", df)
# Delete a column
df_no_city = df.drop('City', axis=1) # axis=1 for columns
print("\nDataFrame after dropping 'City':\n", df_no_city)
4. Handling Missing Data
Pandas uses NaN (Not a Number) to represent missing values.
Python
import numpy as np
df_missing = pd.DataFrame({
'A': [1, 2, np.nan, 4],
'B': [5, np.nan, 7, 8],
'C': [9, 10, 11, 12]
})
print("\nDataFrame with missing values:\n", df_missing)
# Check for missing values
print("\nMissing values:\n", df_missing.isnull())
# Drop rows with any missing values
df_cleaned_rows = df_missing.dropna()
print("\nDataFrame after dropping rows with NaNs:\n", df_cleaned_rows)
# Fill missing values
df_filled = df_missing.fillna(0)
print("\nDataFrame after filling NaNs with 0:\n", df_filled)
# Fill with mean of the column
df_filled_mean = df_missing.fillna(df_missing['A'].mean())
print("\nDataFrame after filling NaNs with mean of 'A':\n", df_filled_mean)
5. Grouping and Aggregation
Python
# Group by 'Gender' and calculate the mean age
mean_age_by_gender = df.groupby('Gender')['Age'].mean()
print("\nMean age by gender:\n", mean_age_by_gender)
# Group by 'City' and count people
city_counts = df['City'].value_counts()
print("\nPeople count by city:\n", city_counts)
6. Reading/Writing Data
Pandas supports various file formats.
Python
# To CSV
df.to_csv('my_data.csv', index=False) # index=False to prevent writing DataFrame index
# From CSV
# df_from_csv = pd.read_csv('my_data.csv')
# print("\nDataFrame read from CSV:\n", df_from_csv)
# To Excel (requires openpyxl or xlwt)
# pip install openpyxl
# df.to_excel('my_data.xlsx', index=False)
# From Excel
# df_from_excel = pd.read_excel('my_data.xlsx')
# print("\nDataFrame read from Excel:\n", df_from_excel)
7. Important Functions for Data Exploration
Python
print("\nUnique values in 'City':", df['City'].unique())
print("Number of unique values in 'City':", df['City'].nunique())