pandas

Intro to Pandas: Python's Data Analysis Powerhouse

What is Pandas?

Pandas is a fast, powerful, and flexible open-source library for data manipulation and analysis in Python. It provides two main data structures:

Series (1-dimensional)
DataFrame (2-dimensional, like a spreadsheet)

Think of it as Python's answer to Excel, but programmable and far more powerful for handling large datasets.

First Steps: Installation & Setup

# Install pandas (if you haven't already)

# In your terminal: pip install pandas

# Import the library - convention uses 'pd' as alias

import pandas as pd

# Also useful to import numpy for numerical operations

import numpy as np

Core Concept #1: The DataFrame

A Pandas DataFrame is a two-dimensional table-like structure in Python where data is arranged in rows and columns. It’s one of the most commonly used tools for handling data and makes it easy to organize, analyze and manipulate data. It can store different types of data such as numbers, text and dates across its columns. The main parts of a DataFrame are:

Data: Actual values in the table.
Rows: Labels that identify each row.
Columns: Labels that define each data category.

We’ll look at the key components of a DataFrame and see how to work with it to make data analysis easier and more efficient.

Creating Dataframes

# From a dictionary (most common for small datasets)

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],

'Age': [25, 30, 35, 28],

'City': ['New York', 'London', 'Tokyo', 'Sydney'],

'Salary': [50000, 60000, 70000, 55000]

}

df = pd.DataFrame(data)

print(df)

The dataframe created

We can create a DataFrame from a dictionary where the keys are column names and the values are lists or arrays.

All arrays/lists must have the same length.
If an index is provided, it must match the length of the arrays.
If no index is provided, Pandas will use a default range index (0, 1, 2, …).

Basic DataFrame Operations:

You can run this in your own environment:

import pandas as pd

data = {

'Name': ['Alice', 'Bob', 'Charlie', 'Diana'],

'Age': [25, 30, 35, 28],

'City': ['New York', 'London', 'Tokyo', 'Sydney'],

'Salary': [50000, 60000, 70000, 55000]

}

df = pd.DataFrame(data)

print("Show All DataFrame")

print(df)

# View the first few rows

print("\nFirst 2 rows")

print(df.head(2)) # First 2 rows

# View only select columns

print("\nSelect columns")

print(df[['Name', 'City']])

# View the last few rows

print("\nLast 2 rows")

print(df.tail(2)) # Last 2 rows

# Get DataFrame info

print("\nData and memory")

print(df.info()) # Data types and memory usage

print("\nShow rows & columns")

print(df.shape) # (rows, columns) - like (4, 4)

print("\nColumn names")

print(df.columns) # Column names

print("\nStats for numerical rows")

print(df.describe()) # Statistical summary for numerical columns

Core Concept #2: Series

A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels (its index).

# Creating a Series

ages = pd.Series([25, 30, 35, 28], name='Age')

print(ages)

# Accessing DataFrame columns (each column is a Series)

names_series = df['Name']

print(names_series)

Viewing a series

Basic Data Selection

Indexing in pandas means simply selecting particular rows and columns of data from a DataFrame. It allows us to access subsets of data such as:

Selecting all rows and some columns.
Selecting some rows and all columns.
Selecting a specific subset of rows and columns.

Indexing can also be known as Subset Selection.

Indexing a Dataframe using indexing operator []

The indexing operator [] is the basic way to select data in Pandas. We can use this operator to access columns from a DataFrame. This method allows us to retrieve one or more columns. The .loc and .iloc indexers also use the indexing operator to make selections.

In order to select a single column, we simply put the name of the column in-between the brackets.

Indexing a DataFrame using .loc[ ]

The .loc method is used to select data by label. This means it uses the row and column labels to access specific data points. .loc[] is versatile because it can select both rows and columns simultaneously based on labels.

In order to select a single row using .loc[], we put a single row label in a .loc function.

Indexing a DataFrame using .iloc[ ]

The .iloc() method allows us to select data based on integer position. Unlike .loc[] (which uses labels) .iloc[] requires us to specify row and column positions as integers (0-based indexing).

In order to select a single row using .iloc[], we can pass a single integer to .iloc[] function.

# Selecting a single column (returns a Series)

names = df['Name']

# Selecting multiple columns (returns a DataFrame)

subset = df[['Name', 'Age']]

# Selecting rows by index

first_row = df.iloc[0] # By position

row_0_to_2 = df.iloc[0:3] # Rows 0, 1, 2

# Selecting rows by label (if index has custom labels)

# df.loc['label_name'] - we'll cover this later

# Filtering rows with conditions

young_people = df[df['Age'] < 30]

high_earners = df[df['Salary'] > 55000]

# Multiple conditions (use & for AND, | for OR)

young_high_earners = df[(df['Age'] < 30) & (df['Salary'] > 55000)]

Basic Data Manipulation

# Adding a new column

df['Bonus'] = df['Salary'] * 0.10 # 10% bonus

# Modifying a column

df['Salary'] = df['Salary'] * 1.05 # Give everyone a 5% raise

# Creating calculated columns

df['Total_Comp'] = df['Salary'] + df['Bonus']

# Sorting

df_sorted = df.sort_values('Salary', ascending=False)

# Grouping and aggregation

city_stats = df.groupby('City')['Salary'].mean()

print(city_stats)

Working with Real Data

Seaborn contains sample datasets such as a set for the Titanic- install seaborn and run the following code to see the result:

# Reading from CSV (one of the most common operations)

# df = pd.read_csv('your_data.csv')

# For practice, let's use a built-in dataset

# Install seaborn first: pip install seaborn

import seaborn as sns

# Load the Titanic dataset

titanic = sns.load_dataset('titanic')

# Explore it

print(titanic.head())

print(titanic.info())

print(titanic.describe())

# Simple analysis

survival_by_class = titanic.groupby('class')['survived'].mean()

print("\nSurvival rate by class:")

print(survival_by_class)

# Filtering

first_class_passengers = titanic[titanic['class'] == 'First']

Handling Missing Data

Missing Data can occur when no information is available for one or more items or for an entire row/column. In Pandas missing data is represented as NaN (Not a Number). Missing data can be problematic in real-world datasets where data is incomplete. Pandas provides several methods to handle such missing data effectively:

Checking for Missing Values using isnull() and notnull()

To check for missing values (NaN) we can use two useful functions:

isnull(): It returns True for NaN (missing) values and False otherwise.
notnull(): It returns the opposite, True for non-missing values and False for NaN values.

For the titanic dataset, run the following to see the result:

# Check for missing values

print(titanic.isnull().sum())

# Basic handling

titanic_clean = titanic.dropna() # Remove rows with any missing values

titanic_filled = titanic.fillna(0) # Fill missing values with 0

titanic_filled_mean = titanic.fillna(titanic.mean()) # Fill with column mean

import pandas as pd

import numpy as np

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

df = pd.DataFrame(dict)

print(df.isnull())

Filling Missing Values using fillna(), replace() and interpolate()

In order to fill null values in a datasets, we use fillna(), replace() and interpolate() function these function replace NaN values with some value of their own. All these function help in filling a null values in datasets of a DataFrame. Interpolate() function is used to fill NA values in the dataframe but it uses various interpolation technique to fill the missing values rather than hard-coding the value.

import pandas as pd

import numpy as np

dict = {'First Score':[100, 90, np.nan, 95],

'Second Score': [30, 45, 56, np.nan],

'Third Score':[np.nan, 40, 80, 98]}

df = pd.DataFrame(dict)

print(df.fillna(0))

https://www.geeksforgeeks.org/pandas/python-pandas-dataframe/

Page updated

Google Sites

Report abuse