Pandas has different ways to group by and aggregate.
1. Call an aggregation functions directly after groupby.
The sum() aggregation is applied to all columns that can be summed.
If reset_index() sets the column names of the aggregated column.
import pandas as pd
import numpy as np
sample = np.random.randint(0, 5, size=(20, 3))
df = pd.DataFrame(sample, columns= ['category', 'column1', 'column2'])
df_sum = df.groupby('category').sum().reset_index()
df_sum.head()
2. Specify the columns and aggregations explicitly.
This allows multiple columns and multiple aggregation functions to be specified in one operation
df_agg1 = df.groupby('category').agg({'column1': 'sum', 'column2': 'max'})
df_agg2 = df.groupby('category').agg({'column1': ['sum', 'count']})
df_agg1.head()
3. Custom aggregation on one column (series)
This passes in the series of the column and returns a single result of the aggregation
def custom_func(col_series):
return col_series.sum()
df_custom = df.groupby('category').agg({'column1': custom_func})
df_custom.head()
4. Custom aggregation on dataframe
This passes in the whole dataframe per group.
The return result is a pd series with a column/index name.
def custom_func2(df):
result = df['column1'].sum()
return pd.Series(result, index=['sum'])
df_custom = df.groupby('category').apply(custom_func2)
df_custom.head()
5.Transform. Calculate a group-wide value for each row.
This does change the granularity of the data.
The example below gets the normalized value for each row per group.
df['group_normalized_value'] = df.groupby('category')['column1'].transform(lambda x: (x - x.mean())/x.std())
df.head()
6. Row number for rows in each group
There are a few ways, but a common and easy way is to use cumsum(), cumulative sum
df.groupby('category').cumcount()
Normally it needs to sort first, otherwise the group maybe broken into separate parts in the original frame. The cumcount only calculates sequentially and resets when the group changes
df.sort_values('category').groupby('category').cumcount()
More on the custom functions in a groupby.
For example, calculate the weighted sum of a group
df = pd.DataFrame({'category':['a','a','b','b'],
'value':[5,10,100,30],
'weight':[.2,.8, .4, .6]})
def weighted_sum(df):
result = sum(df['value'] * df['weight'])
return pd.Series(result, index=['weighted sum']) #this assign a column name to the aggregated column, otherwise returning the sum would be fine
df.groupby('category').apply(weighted_sum)
If want to return multiple aggregated columns:
df = pd.DataFrame({'category':['a','a','b','b'],
'value':[5,10,100,30],
'weight':[.2,.8, .4, .6]})
def weighted_sum(df):
result = {}
result['weighted sum'] = sum(df['value'] * df['weight'])
result['min'] = df['value'].min()
return pd.Series(result, index=['weighted sum', 'min']) #return multiple results as a dictionary
df.groupby('category').apply(weighted_sum)