create pandas data frame from np array
df1 = pd.DataFrame(data=X1, index=ids1, columns = column_names1)
df2 = pd.DataFrame(data=X1, index=ids2, columns = column_names2)
df = pd.concat([df1, df2], axis=1)
set data type
df['col1'] = df['col1'].astype('datetime64')
group by and sort
df.groupby('col1')['col1'].count().sort_values(ascending=True)
df.groupby('postcode')['age'].mean().rename('average age') #this renames the data frame, useless
df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
'height': [9.1, 6.0, 9.5, 34.0],
'weight': [7.9, 7.5, 9.9, 198.0]})
df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max')) #this renames the aggregated column
if aggregate on the same column, can also:
df.groupby('kind')['height'].agg(min_height=np.min, max_height=np.max)
or
df.groupby('kind')['height'].agg([('min height', 'min'), ('max height', 'max')])
select from a data frame
df2 =df.loc[df['Score']< 0.11]
Difference betwen iloc and loc
loc accesses dataframe rows by label (i.e. the index value)
iloc accesses dataframe rows by the integer sequence number.
e.g. create a dataframe with 2 rows:
df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['col1','col2'])
col1 col2
0 a b
1 c d
df.loc[0] #returns the first row
col1 a
col2 b
df.iloc[0] #returns the same first row, because the row order and index (i.e. 0, 1) order are the same
However if we sort the dataframe differently
df = df.sort_values('col1', ascending = False)
col1 col2
1 c d
0 a b
The index order and row order are different now.
df.loc[0] #returns the second row whose index is 0
col1 a
col2 b
df.iloc[0] #returns the the first row whose index is 1
col1 c
col2 d
Interpolation
Pandas can interpolate missing values in the dataframe. The 'interpolate' method can use different methods, e.g .linear, nearest neighbor, backfill, forwardfill, to replace NaN values. With default parameters, note that the linear method is used, i.e. NaN values between two existing values are calculated linearly, and also forwardfill for trailing NaN values. Leading NaN values are left as is because it defaults to forward fill. If set to backfill, then the leading NaN values instead of trailing NaN values will be filled.
s = pd.Series([np.nan, 1, np.nan, 3, np.nan])
s.interpolate()
0 NaN
1 1.0
2 2.0
3 3.0
4 3.0
The NaN between 1 and 3 is calculated as 2. The last NaN is forwardfilled with 3. The first NaN is left NaN.
Join two dataframes
The pandas 'join' joins two dataframes on the index! Note it is the index.
The 'on' parameter can only select a non-index column for the caller, i.e. the left dataframe of the join. The right dataframe will still use index!
The pandas 'merge' however is more of a database join.
The 'on' parameter applies to both dataframes so the on column must exists in both dataframes. There are also left_on and right_on parameters to specify the join column of each dataframe if the two column names are different.
forward fill leading null values incrementally
e.g. the series is [nan, nan,3,4,5, nan, nan, nan]
and we want to forward fill the nan values, but with 6, 7, and 8 incrementally.
Similarly if there are trailing nan values and we want to backfill incremtntall dropping.
import pandas as pd
import numpy as np
df = pd.DataFrame([np.nan, np.nan, 8, 7, np.nan, np.nan], columns=['A'])
df['group'] = df['A'].notnull().cumsum()
df['A'] = df['A'].ffill() - df.groupby('group').cumcount()
df = df[::-1]
df['group'] = df['A'].notnull().cumsum()
df['A'] = df['A'].ffill() + df.groupby('group').cumcount()
df = df[::-1]['A']