pandas groupby

Pandas has different ways to group by and aggregate.

1. Call an aggregation functions directly after groupby.

The sum() aggregation is applied to all columns that can be summed.

If reset_index() sets the column names of the aggregated column.

import pandas as pd

import numpy as np

sample = np.random.randint(0, 5, size=(20, 3))

df = pd.DataFrame(sample, columns= ['category', 'column1', 'column2'])

df_sum = df.groupby('category').sum().reset_index()

df_sum.head()

2. Specify the columns and aggregations explicitly.

This allows multiple columns and multiple aggregation functions to be specified in one operation

df_agg1 = df.groupby('category').agg({'column1': 'sum', 'column2': 'max'})

df_agg2 = df.groupby('category').agg({'column1': ['sum', 'count']})

df_agg1.head()

3. Custom aggregation on one column (series)

This passes in the series of the column and returns a single result of the aggregation

def custom_func(col_series):

return col_series.sum()

df_custom = df.groupby('category').agg({'column1': custom_func})

df_custom.head()

4. Custom aggregation on dataframe

This passes in the whole dataframe per group.

The return result is a pd series with a column/index name.

def custom_func2(df):

result = df['column1'].sum()

return pd.Series(result, index=['sum'])

df_custom = df.groupby('category').apply(custom_func2)

df_custom.head()

5.Transform. Calculate a group-wide value for each row.

This does change the granularity of the data.

The example below gets the normalized value for each row per group.

df['group_normalized_value'] = df.groupby('category')['column1'].transform(lambda x: (x - x.mean())/x.std())

df.head()

6. Row number for rows in each group

There are a few ways, but a common and easy way is to use cumsum(), cumulative sum

df.groupby('category').cumcount()

Normally it needs to sort first, otherwise the group maybe broken into separate parts in the original frame. The cumcount only calculates sequentially and resets when the group changes

df.sort_values('category').groupby('category').cumcount()

More on the custom functions in a groupby.

For example, calculate the weighted sum of a group

df = pd.DataFrame({'category':['a','a','b','b'],

'value':[5,10,100,30],

'weight':[.2,.8, .4, .6]})

def weighted_sum(df):

result = sum(df['value'] * df['weight'])

return pd.Series(result, index=['weighted sum']) #this assign a column name to the aggregated column, otherwise returning the sum would be fine

df.groupby('category').apply(weighted_sum)

If want to return multiple aggregated columns:

df = pd.DataFrame({'category':['a','a','b','b'],

'value':[5,10,100,30],

'weight':[.2,.8, .4, .6]})

def weighted_sum(df):

result = {}

result['weighted sum'] = sum(df['value'] * df['weight'])

result['min'] = df['value'].min()

return pd.Series(result, index=['weighted sum', 'min']) #return multiple results as a dictionary

df.groupby('category').apply(weighted_sum)

Page updated

Google Sites

Report abuse