Pandas
Install Pandas
conda install pandas
To use pandas first import pandas
import pandas as pd
PANDA SERIES
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). What differentiates the NumPy array from a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn't need to hold numeric data, it can hold any arbitrary Python Object.
The key to using a Series is understanding its index. Pandas makes use of these index names or numbers by allowing for fast look ups of information (works like a hash table or dictionary)
You can convert a list,numpy array, or dictionary to a Series:
labels = ['a','b','c']
my_list = [10,20,30]
pd.Series(data=my_list,index=labels)
Now the data can be referred through index(a,b or c which will return 10,20 and 30 respectively). If no index is passed, default numeric index will be the index
Similarly for numpy array can be converted into series
pd.Series(nparrray) or pd.Series(nparr,labels)
Even functions (objects) can be stored in series
pd.Series([sum,print,len])
Using Index
ser1 = pd.Series([1,2,3,4],index = ['USA', 'Germany','USSR', 'Japan'])
USA 1
Germany 2 USSR 3 Japan 4 dtype: int64
ser2 = pd.Series([1,2,5,4],index = ['USA', 'Germany','Italy', 'Japan'])
USA 1 Germany 2 Italy 5 Japan 4 dtype: int64
ser1 +ser2 --> Add two Series - Here the values will be added based on index
Germany 4.0 Italy NaN Japan 8.0 USA 2.0 USSR NaN dtype: float64
DATA FRAMES
DataFrames are the workhorse of pandas and are directly inspired by the R programming language. We can think of a DataFrame as a bunch of Series objects put together to share the same index. DataFrame Columns are just Series
df = pd.DataFrame(randn(5,4),index='A B C D E'.split(),columns='W X Y Z'.split())
df[['W','Z']] --> Will return 'W' and 'Z' columns
Create New Column
df['new'] = df['W'] + df['Y'] --> This will add new column
Removing Column
df.drop('new',axis=1) --> While removing axis need to be specified as '1' axis '0 refers to row and '1' for column.
Remove Row
df.drop('E',axis=0)
inplace=True option should be passed to effect/commit the changes. Without 'inlace' option the DataFrame will not be changed.
i.e df.drop('new',axis=1,inplace=True)
Selecting Rows
df.loc['A'] --> Will return row as series object
W 2.706850 X 0.628133 Y 0.907969 Z 0.503826
Or select based off of position instead of label
df.iloc[2] --> will return third row as series
df.loc[['A','B'],['W','Y']] --> will return rows and columns specified in doc
Conditional Selection
An important feature of pandas is conditional selection using bracket notation, very similar to bumpy:
df
df>0 --> Will return boolean value of the fields
df[df>0] --> Returns only values greater than '0'
df[df['W']>0] --> Similarly as above columns can be used for conditions
df[df['W']>0][['Y','X']] --> Select only the 'Y' and 'X' from the result
For multiple conditions you can use | and & with parenthesis:
df[(df['W']>0) & (df['Y'] > 1)]
Reset Index --> calling reset_index method will reset the index and move the previous index to column
df.reset_index()
newind = 'CA NY WY OR CO'.split() --> create array of states
df['States'] = rewind --> Add new column as 'States'
df.set_index('States',inplace=True) --> Note without 'inplace' - no change in df
MultiIndex
outside = ['G1','G1','G1','G2','G2','G2'] inside = [1,2,3,1,2,3] hier_index = list(zip(outside,inside)) hier_index = pd.MultiIndex.from_tuples(hier_index)
MultiIndex(levels=[['G1', 'G2'], [1, 2, 3]], labels=[[0, 0, 0, 1, 1, 1], [0, 1, 2, 0, 1, 2]])
df = pd.DataFrame(np.random.randn(6,2),index=hier_index,columns=['A','B'])
df.loc['G1'].loc[1]
df.index.names = ['Group','Num']
df.xs('G1')
df.xs(['G1',1]) --> returns first column
df.xs(1,level='A') --> returns columns 'A'
MISSING DATA
df = pd.DataFrame({'A':[1,2,np.nan], 'B':[5,np.nan,np.nan], 'C':[1,2,3]})
df.dropna() --> Will drop NA columns
df.dropna(axis=1) --> Will drop NA rows
df.dropna(thresh=2) --> Columns will be dropped based on threshold set
NA fields can be filled using fillna method
df.fillna(value='FILL VALUE')
df['A'].fillna(value=df['A'].mean()) - Fill NA based on other field values
GROUPBY
The groupby method allows you to group rows of data together and call aggregate functions.
data = {'Company':['GOOG','GOOG','MSFT','MSFT','FB','FB'],
'Person':['Sam','Charlie','Amy','Vanessa','Carl','Sarah'],
'Sales':[200,120,340,124,243,350]}
df = pd.DataFrame(data)
Now you can use the .groupby() method to group rows together based off of a column name. For instance let's group based off of Company. This will create a DataFrameGroupBy object:
df.groupby('Company') will return DataFrameGroupBy object from which we can derive mean max etc.
by_comp = df.groupby("Company")
by_comp.describe() --> this will provide
by_comp.describe().transpose() --> will group object horizontally
df.groupby('Company').mean() - Mean value
MERGING JOINING AND CONCATENATING
There are 3 main ways of combining DataFrames together: Merging, Joining and Concatenating. Concatenation basically glues together DataFrames. Keep in mind that dimensions should match along the axis you are concatenating on. You can use pd.concat and pass in a list of DataFrames to concatenate together:
Concatenation
pd.concat([df1,df2,df3]) --> Concatenation on columns
pd.concat([df1,df2,df3],axis=1) --> Concatenation on rows
Merging
The merge function allows you to merge DataFrames together using a similar logic as merging SQL Tables together
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
left
right
pd.merge(left,right,how='inner',on='key')
pd.merge(left, right, on=['key1', 'key2']) -->
pd.merge(left, right, how='right', on=['key1', 'key2'])
Joining
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
left = pd.DataFrame({'A': ['A0', 'A1', 'A2'],
'B': ['B0', 'B1', 'B2']},
index=['K0', 'K1', 'K2'])
right = pd.DataFrame({'C': ['C0', 'C2', 'C3'],
'D': ['D0', 'D2', 'D3']},
index=['K0', 'K2', 'K3'])
left.join(right)
similarly right.join(left) can be used
OPERATIONS
df = pd.DataFrame({'col1':[1,2,3,4],'col2':[444,555,666,444],'col3':['abc','def','ghi','xyz']})
df['col2'].unique() ---> array([444, 555, 666])
df['col2'].nunique() > 3 ---> Count of number of unique values
df['col2'].value_counts() --> counts number of occurrence
444 2 555 1 666 1
Select from DataFrame using criteria from multiple columns
newdf = df[(df['col1']>2) & (df['col2']==444)]
APPLYING FUNCTIONS
def times2(x): return x*2
df['col1'].apply(times2)
df['col1'].sum() > 10 ---> Sums the values
df.columns --> Gets column names
df.index --> returns index
Sorting and Ordering
df.sort_values(by='col2') #inplace=False by default
PivotTable
Link: http://pbpython.com/pandas-pivot-table-explained.html
data = {'A':['foo','foo','foo','bar','bar','bar'], 'B':['one','one','two','two','one','one'], 'C':['x','y','x','y','x','y'], 'D':[1,3,2,5,4,1]}
df.pivot_table(values='D',index=['A', 'B'],columns=['C'])
DATA INPUT AND OUTPUT
pandas can read a variety of file types using its pd.read_ methods
CSV
df = pd.read_csv('example')
df.to_csv('example',index=False)
EXCEL
Pandas can read and write excel files, keep in mind, this only imports data. Not formulas or images, having images or macros may cause this read_excel method to crash.
pd.read_excel('Excel_Sample.xlsx',sheetname='Sheet1')
df.to_excel('Excel_Sample.xlsx',sheet_name='Sheet1')
HTML
For reading html, you may need to install htmllib5,lxml, and BeautifulSoup4
conda install lxml
conda install html5lib
conda install BeautifulSoup4
conda install sqlalchemy (for sql)
Pandas read_html function will read tables off of a webpage and return a list of DataFrame objects:
df = pd.read_html('http://www.fdic.gov/bank/individual/failed/banklist.html')
df[0] > will return the html output of the webpage
SQL
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported drivers for each SQL dialect in the SQLAlchemy docs.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility, but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which respect the Python DB-AP
The key functions are:
read_sql_table(table_name, con[, schema, ...])
Read SQL database table into a DataFrame.
read_sql_query(sql, con[, index_col, ...])
Read SQL query into a DataFrame.
read_sql(sql, con[, index_col, ...])
Read SQL query or database table into a DataFrame.
DataFrame.to_sql(name, con[, flavor, ...])
Write records stored in a DataFrame to a SQL database
from sqlalchemy import create_engine --> Load SQL engine
engine = create_engine('sqlite:///:memory:') --> Load engine in memory
df.to_sql('my_table', engine) --> Write data to sql (engine is connection)
sql_df = pd.read_sql('my_table',con=engine) --> Read from SQL