Pandas is a fast, powerful, and flexible open-source library for data manipulation and analysis in Python. It provides two main data structures:
Series (1-dimensional)
DataFrame (2-dimensional, like a spreadsheet)
Think of it as Python's answer to Excel, but programmable and far more powerful for handling large datasets.
# Install pandas (if you haven't already)
# In your terminal: pip install pandas
# Import the library - convention uses 'pd' as alias
import pandas as pd
# Also useful to import numpy for numerical operations
import numpy as np
A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. It’s one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dates across its columns. The main parts of a DataFrame are:
Data: Actual values in the table.
Rows: Labels that identify each row.
Columns: Labels that define each data category.
We’ll look at the key components of a DataFrame and see how to work with it to make data analysis easier and more efficient.
# From a dictionary (most common for small datasets)
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Tokyo', 'Sydney'],
'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print(df)
The dataframe created
We can create a DataFrame from a dictionary where the keys are column names and the values are lists or arrays.
All arrays/lists must have the same length.
If an index is provided, it must match the length of the arrays.
If no index is provided, Pandas will use a default range index (0, 1, 2, …).
You can run this in your own environment:
import pandas as pd
data = {
'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'Age': [25, 30, 35, 28],
'City': ['New York', 'London', 'Tokyo', 'Sydney'],
'Salary': [50000, 60000, 70000, 55000]
}
df = pd.DataFrame(data)
print("Show All DataFrame")
print(df)
# View the first few rows
print("\nFirst 2 rows")
print(df.head(2)) # First 2 rows
# View only select columns
print("\nSelect columns")
print(df[['Name', 'City']])
# View the last few rows
print("\nLast 2 rows")
print(df.tail(2)) # Last 2 rows
# Get DataFrame info
print("\nData and memory")
print(df.info()) # Data types and memory usage
print("\nShow rows & columns")
print(df.shape) # (rows, columns) - like (4, 4)
print("\nColumn names")
print(df.columns) # Column names
print("\nStats for numerical rows")
print(df.describe()) # Statistical summary for numerical columns
A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels (its index).
# Creating a Series
ages = pd.Series([25, 30, 35, 28], name='Age')
print(ages)
# Accessing DataFrame columns (each column is a Series)
names_series = df['Name']
print(names_series)
Viewing a series
Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. It allows us to access subsets of data such as:
Selecting all rows and some columns.
Selecting some rows and all columns.
Selecting a specific subset of rows and columns.
Indexing can also be known as Subset Selection.
The indexing operator [] is the basic way to select data in Pandas. We can use this operator to access columns from a DataFrame. This method allows us to retrieve one or more columns. The .loc and .iloc indexers also use the indexing operator to make selections.
In order to select a single column, we simply put the name of the column in-between the brackets.
The .loc method is used to select data by label. This means it uses the row and column labels to access specific data points. .loc[] is versatile because it can select both rows and columns simultaneously based on labels.
In order to select a single row using .loc[], we put a single row label in a .loc function.
The .iloc() method allows us to select data based on integer position. Unlike .loc[] (which uses labels) .iloc[] requires us to specify row and column positions as integers (0-based indexing).
In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.
# Selecting a single column (returns a Series)
names = df['Name']
# Selecting multiple columns (returns a DataFrame)
subset = df[['Name', 'Age']]
# Selecting rows by index
first_row = df.iloc[0] # By position
row_0_to_2 = df.iloc[0:3] # Rows 0, 1, 2
# Selecting rows by label (if index has custom labels)
# df.loc['label_name'] - we'll cover this later
# Filtering rows with conditions
young_people = df[df['Age'] < 30]
high_earners = df[df['Salary'] > 55000]
# Multiple conditions (use & for AND, | for OR)
young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 55000)]
# Adding a new column
df['Bonus'] = df['Salary'] * 0.10 # 10% bonus
# Modifying a column
df['Salary'] = df['Salary'] * 1.05 # Give everyone a 5% raise
# Creating calculated columns
df['Total_Comp'] = df['Salary'] + df['Bonus']
# Sorting
df_sorted = df.sort_values('Salary', ascending=False)
# Grouping and aggregation
city_stats = df.groupby('City')['Salary'].mean()
print(city_stats)
Seaborn contains sample datasets such as a set for the Titanic- install seaborn and run the following code to see the result:
# Reading from CSV (one of the most common operations)
# df = pd.read_csv('your_data.csv')
# For practice, let's use a built-in dataset
# Install seaborn first: pip install seaborn
import seaborn as sns
# Load the Titanic dataset
titanic = sns.load_dataset('titanic')
# Explore it
print(titanic.head())
print(titanic.info())
print(titanic.describe())
# Simple analysis
survival_by_class = titanic.groupby('class')['survived'].mean()
print("\nSurvival rate by class:")
print(survival_by_class)
# Filtering
first_class_passengers = titanic[titanic['class'] == 'First']
Missing Data can occur when no information is available for one or more items or for an entire row/column. In Pandas missing data is represented as NaN (Not a Number). Missing data can be problematic in real-world datasets where data is incomplete. Pandas provides several methods to handle such missing data effectively:
To check for missing values (NaN) we can use two useful functions:
isnull(): It returns True for NaN (missing) values and False otherwise.
notnull(): It returns the opposite, True for non-missing values and False for NaN values.
For the titanic dataset, run the following to see the result:
# Check for missing values
print(titanic.isnull().sum())
# Basic handling
titanic_clean = titanic.dropna() # Remove rows with any missing values
titanic_filled = titanic.fillna(0) # Fill missing values with 0
titanic_filled_mean = titanic.fillna(titanic.mean()) # Fill with column mean
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.isnull())
In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.
import pandas as pd
import numpy as np
dict = {'First Score':[100, 90, np.nan, 95],
'Second Score': [30, 45, 56, np.nan],
'Third Score':[np.nan, 40, 80, 98]}
df = pd.DataFrame(dict)
print(df.fillna(0))
https://www.geeksforgeeks.org/pandas/python-pandas-dataframe/