Pandas

Install Pandas

conda install pandas

To use pandas first import pandas

import pandas as pd

PANDA SERIES

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.

The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary)

You can convert a list,numpy array, or dictionary to a Series:

labels = ['a','b','c']

my_list = [10,20,30]

pd.Series(data=my_list,index=labels)

Now the data can be referred through index(a,b or c which will return 10,20 and 30 respectively). If no index is passed, default numeric index will be the index

Similarly for numpy array can be converted into series

pd.Series(nparrray) or pd.Series(nparr,labels)

Even functions (objects) can be stored in series

pd.Series([sum,print,len])

Using Index

ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])

USA 1

Germany 2 USSR 3 Japan 4 dtype: int64

ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])

USA 1 Germany 2 Italy 5 Japan 4 dtype: int64

ser1 +ser2 --> Add two Series - Here the values will be added based on index

Germany 4.0 Italy NaN Japan 8.0 USA 2.0 USSR NaN dtype: float64

DATA FRAMES

DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. DataFrame Columns are just Series

df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())

df[['W','Z']] --> Will return 'W' and 'Z' columns

Create New Column

df['new'] = df['W'] + df['Y'] --> This will add new column

Removing Column

df.drop('new',axis=1) --> While removing axis need to be specified as '1' axis '0 refers to row and '1' for column.

Remove Row

df.drop('E',axis=0)

inplace=True option should be passed to effect/commit the changes. Without 'inlace' option the DataFrame will not be changed.

i.e df.drop('new',axis=1,inplace=True)

Selecting Rows

df.loc['A'] --> Will return row as series object

W 2.706850 X 0.628133 Y 0.907969 Z 0.503826

Or select based off of position instead of label

df.iloc[2] --> will return third row as series

df.loc[['A','B'],['W','Y']] --> will return rows and columns specified in doc

Conditional Selection

An important feature of pandas is conditional selection using bracket notation, very similar to bumpy:

df

df>0 --> Will return boolean value of the fields

df[df>0] --> Returns only values greater than '0'

df[df['W']>0] --> Similarly as above columns can be used for conditions

df[df['W']>0][['Y','X']] --> Select only the 'Y' and 'X' from the result

For multiple conditions you can use | and & with parenthesis:

df[(df['W']>0) & (df['Y'] > 1)]

Reset Index --> calling reset_index method will reset the index and move the previous index to column

df.reset_index()

newind = 'CA NY WY OR CO'.split() --> create array of states

df['States'] = rewind --> Add new column as 'States'

df.set_index('States',inplace=True) --> Note without 'inplace' - no change in df

MultiIndex

outside = ['G1','G1','G1','G2','G2','G2'] inside = [1,2,3,1,2,3] hier_index = list(zip(outside,inside)) hier_index = pd.MultiIndex.from_tuples(hier_index)

MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])

df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])

df.loc['G1'].loc[1]

df.index.names = ['Group','Num']

df.xs('G1')

df.xs(['G1',1]) --> returns first column

df.xs(1,level='A') --> returns columns 'A'

MISSING DATA

df = pd.DataFrame({'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]})

df.dropna() --> Will drop NA columns

df.dropna(axis=1) --> Will drop NA rows

df.dropna(thresh=2) --> Columns will be dropped based on threshold set

NA fields can be filled using fillna method

df.fillna(value='FILL VALUE')

df['A'].fillna(value=df['A'].mean()) - Fill NA based on other field values

GROUPBY

The groupby method allows you to group rows of data together and call aggregate functions.

data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],

'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],

'Sales':[200,120,340,124,243,350]}

df = pd.DataFrame(data)

Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:

df.groupby('Company') will return DataFrameGroupBy object from which we can derive mean max etc.

by_comp = df.groupby("Company")

by_comp.describe() --> this will provide

by_comp.describe().transpose() --> will group object horizontally

df.groupby('Company').mean() - Mean value

MERGING JOINING AND CONCATENATING

There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:

Concatenation

pd.concat([df1,df2,df3]) --> Concatenation on columns

pd.concat([df1,df2,df3],axis=1) --> Concatenation on rows

Merging

The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together

left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],

'key2': ['K0', 'K1', 'K0', 'K1'],

'A': ['A0', 'A1', 'A2', 'A3'],

'B': ['B0', 'B1', 'B2', 'B3']})

right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],

'key2': ['K0', 'K0', 'K0', 'K0'],

'C': ['C0', 'C1', 'C2', 'C3'],

'D': ['D0', 'D1', 'D2', 'D3']})

left

right

pd.merge(left,right,how='inner',on='key')

pd.merge(left, right, on=['key1', 'key2']) -->

pd.merge(left, right, how='right', on=['key1', 'key2'])

Joining

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],

'B': ['B0', 'B1', 'B2']},

index=['K0', 'K1', 'K2'])

right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],

'D': ['D0', 'D2', 'D3']},

index=['K0', 'K2', 'K3'])

left.join(right)

similarly right.join(left) can be used

OPERATIONS

df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})

df['col2'].unique() ---> array([444, 555, 666])

df['col2'].nunique() > 3 ---> Count of number of unique values

df['col2'].value_counts() --> counts number of occurrence

444 2 555 1 666 1

Select from DataFrame using criteria from multiple columns

newdf = df[(df['col1']>2) & (df['col2']==444)]

APPLYING FUNCTIONS

def times2(x): return x*2

df['col1'].apply(times2)

df['col1'].sum() > 10 ---> Sums the values

df.columns --> Gets column names

df.index --> returns index

Sorting and Ordering

df.sort_values(by='col2') #inplace=False by default

PivotTable

Link: http://pbpython.com/pandas-pivot-table-explained.html

data = {'A':['foo','foo','foo','bar','bar','bar'], 'B':['one','one','two','two','one','one'], 'C':['x','y','x','y','x','y'], 'D':[1,3,2,5,4,1]}

df.pivot_table(values='D',index=['A', 'B'],columns=['C'])

DATA INPUT AND OUTPUT

pandas can read a variety of file types using its pd.read_ methods

CSV

df = pd.read_csv('example')

df.to_csv('example',index=False)

EXCEL

Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.

pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')

df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')

HTML

For reading html, you may need to install htmllib5,lxml, and BeautifulSoup4

conda install lxml

conda install html5lib

conda install BeautifulSoup4

conda install sqlalchemy (for sql)

Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:

df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')

df[0] > will return the html output of the webpage

SQL

The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.

If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-AP

The key functions are:

    • read_sql_table(table_name, con[, schema, ...])

        • Read SQL database table into a DataFrame.

    • read_sql_query(sql, con[, index_col, ...])

        • Read SQL query into a DataFrame.

    • read_sql(sql, con[, index_col, ...])

        • Read SQL query or database table into a DataFrame.

    • DataFrame.to_sql(name, con[, flavor, ...])

      • Write records stored in a DataFrame to a SQL database

from sqlalchemy import create_engine --> Load SQL engine

engine = create_engine('sqlite:///:memory:') --> Load engine in memory

df.to_sql('my_table', engine) --> Write data to sql (engine is connection)

sql_df = pd.read_sql('my_table',con=engine) --> Read from SQL