Try to update the 'desc' column of a data frame if the count of id is more than 3.
import pandas as pd
df = pd.DataFrame({
"id": [1, 1, 1, 2, 2, 2, 2],
"type": ["m", "n", "o", "m", "m", "n", "n"],
"desc": ["aa", "bb", "cc", "dd", "ee", "ff", "gg"]
})
df.groupby('id').transform(len)
A groupby and transform will give us the count for each group, and it simply copy the count to every row.
'''
type desc
0 3 3
1 3 3
2 3 3
3 4 4
4 4 4
5 4 4
6 4 4
'''
we just need one column of counts, so specify a column, which will make it a series before transform. Below has no column name, only a series named 'type'.
df.groupby('id')['type'].transform(len)
'''
0 3
1 3
2 3
3 4
4 4
5 4
6 4
Name: type, dtype: int64
'''
Whey use tranform? there is another option to use:
df.groupby('id').count()
type desc
id
1.0 2 2
2.0 4 4
This aggregates and keeps only one row per group, so can't be used for filtering data frame later on.
So lets keep using the transform(len). However, it could have a problem when the group by column has a None value.
df = pd.DataFrame({
"id": [1, 1, None, 2, 2, 2, 2],
"type": ["m", "n", "o", "m", "m", "n", "n"],
"desc": ["aa", "bb", "cc", "dd", "ee", "ff", "gg"]
})
df.groupby('id')['type'].transform(len)
#ValueError: Length mismatch: Expected axis has 6 elements, new values have 7 elements
We have to use another function 'size', here transform accepts the function string name.
df.groupby('id')['type'].transform('size')
'''
0 2.0
1 2.0
2 NaN
3 4.0
4 4.0
5 4.0
6 4.0
Name: type, dtype: float64
'''
The row with NaN is counted. Good.
Now we want to select those rows with a count larger than 3:
df.groupby('id')['type'].transform('size').gt(3)
'''
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: type, dtype: bool
'''
And select rows from the data frame based on the condition
condition = df.groupby('id')['type'].transform('size').gt(3)
df[condition]
'''
id type desc
3 2.0 m dd
4 2.0 m ee
5 2.0 n ff
6 2.0 n gg
'''
So only the rows with count larger than 3 are returned.
Now we want to update the description column of the rows as below, but it will fail because its working on a copy of the data frame.
condition = df.groupby('id')['type'].transform('size').gt(3)
df[condition]['desc'] = 'updated'
#A value is trying to be set on a copy of a slice from a DataFrame.
We need to use the .loc to do a inplace update.
condition = df.groupby('id')['type'].transform('size').gt(3)
df.loc[condition, ['desc']] = 'updated'
df
'''
id type desc
0 1.0 m aa
1 1.0 n bb
2 NaN o cc
3 2.0 m updated
4 2.0 m updated
5 2.0 n updated
6 2.0 n updated
'''