Analyzing Data using Pandas

Python Pandas Is used for relational or labeled data and provides various data structures for manipulating such data and time series. This library is built on top of the NumPy library. This module is generally imported as:

import pandas as pd

Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the library using the alias, it just helps in writing less amount code every time a method or property is called. Pandas generally provide two data structures for manipulating data, They are:

Series
Dataframe

Series:

Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.

It can be created using the Series() function by loading the dataset from the existing storage like SQL, Database, CSV Files, Excel Files, etc., or from data structures like lists, dictionaries, etc.

Example: Python Pandas Creating Series

import pandas as pd

import numpy as np

# Creating empty series

ser = pd.Series()

print(ser)

# simple array

data = np.array(['g', 'e', 'e', 'k', 's'])

ser = pd.Series(data)

print(ser)

Dataframe:

Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.

It can be created using the Dataframe() method and just like series it can also be from different file types and data structures.

Example: Python Pandas Creating Dataframe

import pandas as pd

# Calling DataFrame constructor

df = pd.DataFrame()

print(df)

# list of strings

lst = ['Geeks', 'For', 'Geeks', 'is',

'portal', 'for', 'Geeks']

# Calling DataFrame constructor on list

df = pd.DataFrame(lst)

Creating Dataframe from CSV

We can create a dataframe from the CSV files using the read_csv() function.

Note: This dataset can be downloaded from here.

Example: Python Pandas read CSV

import pandas as pd

# Reading the CSV file

df = pd.read_csv("Iris.csv")

# Printing top 5 rows

df.head()

Filtering DataFrame

Pandas dataframe.filter() function is used to Subset rows or columns of dataframe according to labels in the specified index. Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.

Example: Python Pandas Filter Dataframe

import pandas as pd

# Reading the CSV file

df = pd.read_csv("Iris.csv")

# applying filter function

df.filter(["Species", "SepalLengthCm", "SepalLengthCm"]).head()

Sorting DataFrame

In order to sort the data frame in pandas, the function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending or Descending order.

Example: Python Pandas Sorting Dataframe in Ascending Order

import pandas as pd

# Reading the CSV file

df = pd.read_csv("Iris.csv")

# applying filter function

df.sort_values(by=['SepalLengthCm'])

Pandas GroupBy

Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. In real data science projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept. Groupby mainly refers to a process involving one or more of the following steps they are:

Splitting: It is a process in which we split data into group by applying some conditions on datasets.
Applying: It is a process in which we apply a function to each group independently.
Combining: It is a process in which we combine different datasets after applying groupby and results into a data structure.

The following image will help in understanding a process involve in Groupby concept.

1. Group the unique values from the Team column

2. Now there’s a bucket for each group

3. Toss the other data into the buckets

4. Apply a function on the weight column of each bucket.

Example: Python Pandas GroupBy

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data

data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',

'Gaurav', 'Anuj', 'Princi', 'Abhi'],

'Age': [27, 24, 22, 32,

33, 36, 27, 32],

'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',

'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],

'Qualification': ['Msc', 'MA', 'MCA', 'Phd',

'B.Tech', 'B.com', 'Msc', 'MA']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data1)

print("Original Dataframe")

display(df)

# applying groupby() function to

# group the data on Name value.

gk = df.groupby('Name')

# Let's print the first entries

# in all the groups formed.

print("After Creating Groups")

gk.first()

Applying function to group:

After splitting a data into a group, we apply a function to each group in order to do that we perform some operations they are:

Aggregation: It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute group sums or means
Transformation: It is a process in which we perform some group-specific computations and return a like-indexed. For Example, Filling NAs within groups with a value derived from each group
Filtration: It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For Example, Filtering out data based on the group sum or mean

Pandas Aggregation

Aggregation is a process in which we compute a summary statistic about each group. The aggregated function returns a single aggregated value for each group. After splitting data into groups using groupby function, several aggregation operations can be performed on the grouped data.

Example: Python Pandas Aggregation

# importing pandas module

import pandas as pd

# importing numpy as np

import numpy as np

# Define a dictionary containing employee data

data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',

'Gaurav', 'Anuj', 'Princi', 'Abhi'],

'Age': [27, 24, 22, 32,

33, 36, 27, 32],

'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',

'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],

'Qualification': ['Msc', 'MA', 'MCA', 'Phd',

'B.Tech', 'B.com', 'Msc', 'MA']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data1)

# performing aggregation using

# aggregate method

grp1 = df.groupby('Name')

grp1.aggregate(np.sum)

Concatenating DataFrame

In order to concat dataframe, we use concat() function which helps in concatenating a dataframe. This function does all the heavy lifting of performing concatenation operations along with an axis of Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.

Example: Python Pandas Concatenate Dataframe

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data

data1 = {'key': ['K0', 'K1', 'K2', 'K3'],

'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],

'Age':[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {'key': ['K0', 'K1', 'K2', 'K3'],

'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],

'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data2)

display(df, df1)

# combining series and dataframe

res = pd.concat([df, df1], axis=1)

res

Merging DataFrame

When we need to combine very large DataFrames, joins serve as a powerful way to perform these operations swiftly. Joins can only be done on two DataFrames at a time, denoted as left and right tables. The key is the common column that the two DataFrames will be joined on. It’s a good practice to use keys that have unique values throughout the column to avoid unintended duplication of row values. Pandas provide a single function, merge(), as the entry point for all standard database join operations between DataFrame objects.

There are four basic ways to handle the join (inner, left, right, and outer), depending on which rows must retain their data.

Example: Python Pandas Merge Dataframe

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data

data1 = {'key': ['K0', 'K1', 'K2', 'K3'],

'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],

'Age':[27, 24, 22, 32],}

# Define a dictionary containing employee data

data2 = {'key': ['K0', 'K1', 'K2', 'K3'],

'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],

'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data1)

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data2)

display(df, df1)

# using .merge() function

res = pd.merge(df, df1, on='key')

res

Joining DataFrame

In order to join dataframe, we use .join() function this function is used for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

Example: Python Pandas Join Dataframe

# importing pandas module

import pandas as pd

# Define a dictionary containing employee data

data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],

'Age':[27, 24, 22, 32]}

# Define a dictionary containing employee data

data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],

'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}

# Convert the dictionary into DataFrame

df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])

# Convert the dictionary into DataFrame

df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])

display(df, df1)

# joining dataframe

res = df.join(df1)

res

Page updated

Google Sites

Report abuse