Pandas

pandas (Gemini)

Pandas is a powerful, open-source data manipulation and analysis library for Python. It provides data structures like DataFrames and Series, which are optimized for handling tabular data.

1. Installation

If you don't have pandas installed, use pip:

Bash

pip install pandas

2. Core Data Structures: Series & DataFrame

Series

A Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floats, Python objects, etc.). It's like a single column of a spreadsheet or a SQL table.

Python

import pandas as pd

# Creating a Series

s = pd.Series([10, 20, 30, 40, 50])

print("Series 's':\n", s)

# Series with a custom index

s_indexed = pd.Series([100, 200, 300], index=['a', 'b', 'c'])

print("\nSeries 's_indexed':\n", s_indexed)

# Accessing elements

print("\nElement at index 0:", s[0])

print("Element at label 'b':", s_indexed['b'])

DataFrame

A DataFrame is a two-dimensional labeled data structure with columns of potentially different types. It's like a spreadsheet or a SQL table, or a dictionary of Series objects.

Python

# Creating a DataFrame from a dictionary

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'David'],

'Age': [24, 27, 22, 32],

'City': ['New York', 'Los Angeles', 'Chicago', 'Houston']

}

df = pd.DataFrame(data)

print("\nDataFrame 'df':\n", df)

# Creating a DataFrame from a list of lists (with columns and index)

df_list = pd.DataFrame(

[[1, 'Apple'], [2, 'Banana'], [3, 'Orange']],

columns=['ID', 'Fruit'],

index=['a', 'b', 'c']

)

print("\nDataFrame 'df_list':\n", df_list)

3. Basic DataFrame Operations

Viewing Data

Python

# Display the first 5 rows (default)

print("\nFirst 2 rows of 'df':\n", df.head(2))

# Display the last 3 rows

print("\nLast 3 rows of 'df':\n", df.tail(3))

# Get quick summary statistics

print("\nInfo on 'df':")

df.info()

print("\nDescription of 'df':\n", df.describe())

Selection (Column & Row)

Python

# Select a single column (returns a Series)

names = df['Name']

print("\n'Name' column:\n", names)

# Select multiple columns (returns a DataFrame)

name_age = df[['Name', 'Age']]

print("\n'Name' and 'Age' columns:\n", name_age)

# Select rows by label (loc)

row_alice = df.loc[0] # Selects row with index label 0

print("\nRow for Alice (using loc):\n", row_alice)

# Select rows by integer position (iloc)

row_bob = df.iloc[1] # Selects the second row (0-indexed)

print("\nRow for Bob (using iloc):\n", row_bob)

# Select specific rows and columns

subset = df.loc[1:3, ['Name', 'City']]

print("\nSubset (rows 1-3, Name & City):\n", subset)

Filtering Data

Python

# Filter rows where Age > 25

older_than_25 = df[df['Age'] > 25]

print("\nPeople older than 25:\n", older_than_25)

# Multiple conditions

ny_or_la = df[(df['City'] == 'New York') | (df['City'] == 'Los Angeles')]

print("\nPeople from New York or Los Angeles:\n", ny_or_la)

Adding/Modifying/Deleting Columns

Python

# Add a new column

df['Gender'] = ['F', 'M', 'M', 'M']

print("\nDataFrame after adding 'Gender':\n", df)

# Modify an existing column

df['Age'] = df['Age'] + 1

print("\nDataFrame after incrementing 'Age':\n", df)

# Delete a column

df_no_city = df.drop('City', axis=1) # axis=1 for columns

print("\nDataFrame after dropping 'City':\n", df_no_city)

4. Handling Missing Data

Pandas uses NaN (Not a Number) to represent missing values.

Python

import numpy as np

df_missing = pd.DataFrame({

'A': [1, 2, np.nan, 4],

'B': [5, np.nan, 7, 8],

'C': [9, 10, 11, 12]

})

print("\nDataFrame with missing values:\n", df_missing)

# Check for missing values

print("\nMissing values:\n", df_missing.isnull())

# Drop rows with any missing values

df_cleaned_rows = df_missing.dropna()

print("\nDataFrame after dropping rows with NaNs:\n", df_cleaned_rows)

# Fill missing values

df_filled = df_missing.fillna(0)

print("\nDataFrame after filling NaNs with 0:\n", df_filled)

# Fill with mean of the column

df_filled_mean = df_missing.fillna(df_missing['A'].mean())

print("\nDataFrame after filling NaNs with mean of 'A':\n", df_filled_mean)

5. Grouping and Aggregation

Python

# Group by 'Gender' and calculate the mean age

mean_age_by_gender = df.groupby('Gender')['Age'].mean()

print("\nMean age by gender:\n", mean_age_by_gender)

# Group by 'City' and count people

city_counts = df['City'].value_counts()

print("\nPeople count by city:\n", city_counts)

6. Reading/Writing Data

Pandas supports various file formats.

Python

# To CSV

df.to_csv('my_data.csv', index=False) # index=False to prevent writing DataFrame index

# From CSV

# df_from_csv = pd.read_csv('my_data.csv')

# print("\nDataFrame read from CSV:\n", df_from_csv)

# To Excel (requires openpyxl or xlwt)

# pip install openpyxl

# df.to_excel('my_data.xlsx', index=False)

# From Excel

# df_from_excel = pd.read_excel('my_data.xlsx')

# print("\nDataFrame read from Excel:\n", df_from_excel)

7. Important Functions for Data Exploration

Python

print("\nUnique values in 'City':", df['City'].unique())

print("Number of unique values in 'City':", df['City'].nunique())

Page updated

Google Sites

Report abuse