Python Pandas Is used for relational or labeled data and provides various data structures for manipulating such data and time series. This library is built on top of the NumPy library. This module is generally imported as:
import pandas as pd
Here, pd is referred to as an alias to the Pandas. However, it is not necessary to import the library using the alias, it just helps in writing less amount code every time a method or property is called. Pandas generally provide two data structures for manipulating data, They are:
Series
Dataframe
Pandas Series is a one-dimensional labelled array capable of holding data of any type (integer, string, float, python objects, etc.). The axis labels are collectively called indexes. Pandas Series is nothing but a column in an excel sheet. Labels need not be unique but must be a hashable type. The object supports both integer and label-based indexing and provides a host of methods for performing operations involving the index.
It can be created using the Series() function by loading the dataset from the existing storage like SQL, Database, CSV Files, Excel Files, etc., or from data structures like lists, dictionaries, etc.
import pandas as pd
import numpy as np
# Creating empty series
ser = pd.Series()
print(ser)
# simple array
data = np.array(['g', 'e', 'e', 'k', 's'])
ser = pd.Series(data)
print(ser)
Dataframe:
Pandas DataFrame is a two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Pandas DataFrame consists of three principal components, the data, rows, and columns.
It can be created using the Dataframe() method and just like series it can also be from different file types and data structures.
import pandas as pd
# Calling DataFrame constructor
df = pd.DataFrame()
print(df)
# list of strings
lst = ['Geeks', 'For', 'Geeks', 'is',
'portal', 'for', 'Geeks']
# Calling DataFrame constructor on list
df = pd.DataFrame(lst)
df
We can create a dataframe from the CSV files using the read_csv() function.
Note: This dataset can be downloaded from here.
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# Printing top 5 rows
df.head()
Pandas dataframe.filter() function is used to Subset rows or columns of dataframe according to labels in the specified index. Note that this routine does not filter a dataframe on its contents. The filter is applied to the labels of the index.
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# applying filter function
df.filter(["Species", "SepalLengthCm", "SepalLengthCm"]).head()
In order to sort the data frame in pandas, the function sort_values() is used. Pandas sort_values() can sort the data frame in Ascending or Descending order.
import pandas as pd
# Reading the CSV file
df = pd.read_csv("Iris.csv")
# applying filter function
df.sort_values(by=['SepalLengthCm'])
Groupby is a pretty simple concept. We can create a grouping of categories and apply a function to the categories. In real data science projects, you’ll be dealing with large amounts of data and trying things over and over, so for efficiency, we use the Groupby concept. Groupby mainly refers to a process involving one or more of the following steps they are:
Splitting: It is a process in which we split data into group by applying some conditions on datasets.
Applying: It is a process in which we apply a function to each group independently.
Combining: It is a process in which we combine different datasets after applying groupby and results into a data structure.
The following image will help in understanding a process involve in Groupby concept.
1. Group the unique values from the Team column
2. Now there’s a bucket for each group
3. Toss the other data into the buckets
4. Apply a function on the weight column of each bucket.
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',
'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32,
33, 36, 27, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd',
'B.Tech', 'B.com', 'Msc', 'MA']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
print("Original Dataframe")
display(df)
# applying groupby() function to
# group the data on Name value.
gk = df.groupby('Name')
# Let's print the first entries
# in all the groups formed.
print("After Creating Groups")
gk.first()
After splitting a data into a group, we apply a function to each group in order to do that we perform some operations they are:
Aggregation: It is a process in which we compute a summary statistic (or statistics) about each group. For Example, Compute group sums or means
Transformation: It is a process in which we perform some group-specific computations and return a like-indexed. For Example, Filling NAs within groups with a value derived from each group
Filtration: It is a process in which we discard some groups, according to a group-wise computation that evaluates True or False. For Example, Filtering out data based on the group sum or mean
Aggregation is a process in which we compute a summary statistic about each group. The aggregated function returns a single aggregated value for each group. After splitting data into groups using groupby function, several aggregation operations can be performed on the grouped data.
# importing pandas module
import pandas as pd
# importing numpy as np
import numpy as np
# Define a dictionary containing employee data
data1 = {'Name': ['Jai', 'Anuj', 'Jai', 'Princi',
'Gaurav', 'Anuj', 'Princi', 'Abhi'],
'Age': [27, 24, 22, 32,
33, 36, 27, 32],
'Address': ['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj',
'Jaunpur', 'Kanpur', 'Allahabad', 'Aligarh'],
'Qualification': ['Msc', 'MA', 'MCA', 'Phd',
'B.Tech', 'B.com', 'Msc', 'MA']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
# performing aggregation using
# aggregate method
grp1 = df.groupby('Name')
grp1.aggregate(np.sum)
In order to concat dataframe, we use concat() function which helps in concatenating a dataframe. This function does all the heavy lifting of performing concatenation operations along with an axis of Pandas objects while performing optional set logic (union or intersection) of the indexes (if any) on the other axes.
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2)
display(df, df1)
# combining series and dataframe
res = pd.concat([df, df1], axis=1)
res
When we need to combine very large DataFrames, joins serve as a powerful way to perform these operations swiftly. Joins can only be done on two DataFrames at a time, denoted as left and right tables. The key is the common column that the two DataFrames will be joined on. It’s a good practice to use keys that have unique values throughout the column to avoid unintended duplication of row values. Pandas provide a single function, merge(), as the entry point for all standard database join operations between DataFrame objects.
There are four basic ways to handle the join (inner, left, right, and outer), depending on which rows must retain their data.
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32],}
# Define a dictionary containing employee data
data2 = {'key': ['K0', 'K1', 'K2', 'K3'],
'Address':['Nagpur', 'Kanpur', 'Allahabad', 'Kannuaj'],
'Qualification':['Btech', 'B.A', 'Bcom', 'B.hons']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1)
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2)
display(df, df1)
# using .merge() function
res = pd.merge(df, df1, on='key')
res
In order to join dataframe, we use .join() function this function is used for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
# importing pandas module
import pandas as pd
# Define a dictionary containing employee data
data1 = {'Name':['Jai', 'Princi', 'Gaurav', 'Anuj'],
'Age':[27, 24, 22, 32]}
# Define a dictionary containing employee data
data2 = {'Address':['Allahabad', 'Kannuaj', 'Allahabad', 'Kannuaj'],
'Qualification':['MCA', 'Phd', 'Bcom', 'B.hons']}
# Convert the dictionary into DataFrame
df = pd.DataFrame(data1,index=['K0', 'K1', 'K2', 'K3'])
# Convert the dictionary into DataFrame
df1 = pd.DataFrame(data2, index=['K0', 'K2', 'K3', 'K4'])
display(df, df1)
# joining dataframe
res = df.join(df1)
res