pandas

create pandas data frame from np array

df1 = pd.DataFrame(data=X1, index=ids1, columns = column_names1)

df2 = pd.DataFrame(data=X1, index=ids2, columns = column_names2)

df = pd.concat([df1, df2], axis=1)

set data type

df['col1'] = df['col1'].astype('datetime64')

group by and sort

df.groupby('col1')['col1'].count().sort_values(ascending=True)

df.groupby('postcode')['age'].mean().rename('average age') #this renames the data frame, useless

df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],

'height': [9.1, 6.0, 9.5, 34.0],

'weight': [7.9, 7.5, 9.9, 198.0]})

df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max')) #this renames the aggregated column

if aggregate on the same column, can also:

df.groupby('kind')['height'].agg(min_height=np.min, max_height=np.max)

df.groupby('kind')['height'].agg([('min height', 'min'), ('max height', 'max')])

select from a data frame

df2 =df.loc[df['Score']< 0.11]

Difference betwen iloc and loc

loc accesses dataframe rows by label (i.e. the index value)

iloc accesses dataframe rows by the integer sequence number.

e.g. create a dataframe with 2 rows:

df = pd.DataFrame([['a', 'b'], ['c', 'd']], columns=['col1','col2'])

col1 col2

0 a b

1 c d

df.loc[0] #returns the first row

col1 a

col2 b

df.iloc[0] #returns the same first row, because the row order and index (i.e. 0, 1) order are the same

However if we sort the dataframe differently

df = df.sort_values('col1', ascending = False)

col1 col2

1 c d

0 a b

The index order and row order are different now.

df.loc[0] #returns the second row whose index is 0

col1 a

col2 b

df.iloc[0] #returns the the first row whose index is 1

col1 c

col2 d

Interpolation

Pandas can interpolate missing values in the dataframe. The 'interpolate' method can use different methods, e.g .linear, nearest neighbor, backfill, forwardfill, to replace NaN values. With default parameters, note that the linear method is used, i.e. NaN values between two existing values are calculated linearly, and also forwardfill for trailing NaN values. Leading NaN values are left as is because it defaults to forward fill. If set to backfill, then the leading NaN values instead of trailing NaN values will be filled.

s = pd.Series([np.nan, 1, np.nan, 3, np.nan])

s.interpolate()

0 NaN

1 1.0

2 2.0

3 3.0

4 3.0

The NaN between 1 and 3 is calculated as 2. The last NaN is forwardfilled with 3. The first NaN is left NaN.

Join two dataframes

The pandas 'join' joins two dataframes on the index! Note it is the index.

The 'on' parameter can only select a non-index column for the caller, i.e. the left dataframe of the join. The right dataframe will still use index!

The pandas 'merge' however is more of a database join.

The 'on' parameter applies to both dataframes so the on column must exists in both dataframes. There are also left_on and right_on parameters to specify the join column of each dataframe if the two column names are different.

forward fill leading null values incrementally

e.g. the series is [nan, nan,3,4,5, nan, nan, nan]

and we want to forward fill the nan values, but with 6, 7, and 8 incrementally.

Similarly if there are trailing nan values and we want to backfill incremtntall dropping.

import pandas as pd

import numpy as np

df = pd.DataFrame([np.nan, np.nan, 8, 7, np.nan, np.nan], columns=['A'])

df['group'] = df['A'].notnull().cumsum()

df['A'] = df['A'].ffill() - df.groupby('group').cumcount()

df = df[::-1]

df['group'] = df['A'].notnull().cumsum()

df['A'] = df['A'].ffill() + df.groupby('group').cumcount()

df = df[::-1]['A']

Page updated

Google Sites

Report abuse