Data Science‎ > ‎With Python‎ > ‎

3. Pandas



Introduction


Python can be used to manipulate, clean, and query data by using the Pandas data tool kit.
  • Pandas was created in 2008 by Wes McKinney
  • Open source New BSD License
  • 100 different contributors

Comprehensive book Learning the Pandas Library by Matt Harrison

The field of data science is rapidly changing. There's new toolkits and method being created everyday. It can be tough to stay on top of it all. And Marco Rodriguez and Tim Golden maintained a wonderful blog aggregator site called Planet Python. You can visit the webpage at planetpython.org
  • Excellent blog aggregator for python related news
  • Significant number of data science and python tutorials are posted
Kyle Polich runs an excellent podcast called Data Skeptic. Well, it isn't Python based per se, it's well produced and it has a wonderful mixture of interviews with experts in the field as well as short educational lessons. Much of the word he describes is specific to machine learning methods.
Covers data science in general including
  • Mini education lessons
  • Interviews 
  • Trends
  • Shared community project

Performence


You can use panda and the nice thing that it brings
when you need performance, you can do :

d = s.values
%timeit d[i]
1000000 loops, best of 3: 998 ns per loop

and you will get the same performance , as panda use numpy eventually

Series Data Structure


The series is one of the core data structures in pandas. 
You think of it a cross between a list and a dictionary. The items are all stored in an order and there's labels with which you can retrieve them. 
An easy way to visualize this is two columns of data. The first is the special index, a lot like the dictionary keys. While the second is your actual data. It's important to note that the data column has a label of its own and can be retrieved using the .name attribute. This is different than with dictionaries and is useful when it comes to merging multiple columns of data.



You can create a series by passing in a list of values. When you do this, Pandas automatically assigns an index starting with zero and sets the name of the series to None.


Import the pandas library as pd
Have a look at Series object

The documentation indicates that you can pass in some data, an index and a name. The data can be anything, that's array-like, like a list.

import pandas as pd
pd.Series?


Init signature: pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Docstring:     
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be any hashable type. The object
supports both integer- and label-based indexing and provides a host of
methods for performing operations involving the index. Statistical
methods from ndarray have been overridden to automatically exclude
missing data (currently represented as NaN)

Operations between Series (+, -, /, *, **) align values based on their
associated index values-- they need not be the same length. The result
index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, dict, or scalar value
    Contains data stored in Series
index : array-like or Index (1d)
    Values must be unique and hashable, same length as data. Index
    object (or other iterable of same length as data) Will default to
    RangeIndex(len(data)) if not provided. If both a dict and index
    sequence are used, the index will override the keys found in the
    dict.
dtype : numpy.dtype or None
    If None, dtype will be inferred
copy : boolean, default False
    Copy input data
File:           /opt/conda/lib/python3.6/site-packages/pandas/core/series.py
Type:           type

The pandas has automatically identified the type of the data being held in the list, in this case we passed in a list of strings and panda set the type to object.

animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
0    Tiger
1     Bear
2    Moose
dtype: object

If we passed in a list of whole numbers, for instance, we could see that panda sets the type to int64. Underneath panda stores series values in a typed array using the Numpy library. This offers significant speed-up when processing data versus traditional python lists.

numbers = [1, 2, 3]
pd.Series(numbers)
0    1
1    2
2    3
dtype: int64

There's some other typing details that exist for performance that are important to know. The most important is how Numpy and thus pandas handle missing data. In Python, we have the none type to indicate a lack of data. But what do we do if we want to have a typed list like we do in the series object?

Underneath, pandas does some type conversion. If we create a list of strings and we have one element, a None type, pandas inserts it as a None and uses the type object for the underlying array.

animals = ['Tiger', 'Bear', None]
pd.Series(animals)
0    Tiger
1     Bear
2     None
dtype: object

If we create a list of numbers, integers or floats, and put in the None type, pandas automatically converts this to a special floating point value designated as NAN, which stands for not a number.

numbers = [1, 2, None]
pd.Series(numbers)
0    1.0
1    2.0
2    NaN
dtype: float64

For those who might not have done scientific computing in Python before, this is a pretty important point.
NAN is not none and when we try the equality test, it's false.

import numpy as np
np.nan == None
False

It turns out that you actually can't do an equality test of NAN to itself. When you do, the answer is always false. You need to use special functions to test for the presence of not a number, such as the Numpy library is NAN.

Keep in mind when you see NAN, it's meaning is similar to none, but it's a numeric value and it's treated differently for efficiency reasons.

np.nan == np.nan
False

np.isnan(np.nan)
True



Creating a Series



Often you have label data that you want to manipulate. A series can be created from dictionary data. If you do this, the index is automatically assigned to the keys of the dictionary that you provided and not just incrementing integers.

When we create the series, we see that, since it was string data, panda set the data type of the series to object. We set the list of the countries as the value of the series and that the index values can be set to the keys from our dictionary.

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

Once the series has been created, we can get the index object using the index attribute.

s.index
Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

Separate your index creation from the data by passing in the index as a list explicitly to the series.

s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
s
India      Tiger
America     Bear
Canada     Moose
dtype: object

So what happens if your list of values in the index object are not aligned with the keys in your dictionary for creating the series? 

Well, pandas overrides the automatic creation to favor only and all of the indices values that you provided. 

So it will ignore it from your dictionary, all keys, which are not in your index, and pandas will add non type or NAN values for any index value you provide, which is not in your dictionary key list. 

In this example, we pass in a dictionary of four items but only two are preserved in the series object because of the index list. We see that hockey has been added but since it's also in the index list, it has no value associated with it.

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
s
Golf      Scotland
Sumo         Japan
Hockey         NaN
dtype: object



Querying the Series


A panda.Series can be queried, either by the index position or the index label. As we saw, if you don't give an index to the series, the position and the label are effectively the same values.

sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

To query by numeric location, starting at zero, use the iloc attribute.

s.iloc[3]
'South Korea'

To query by the index label, you can use the loc attribute.

s.loc['Golf']
'Scotland'

Keep in mind that iloc and loc are not methods, they are attributes. So you don't use parentheses to query them, but square brackets instead, which we'll call the indexing operator. Though in Python, this calls get and set an item methods depending on the context of its use.

Pandas tries to make our code a bit more readable and provides a sort of smart syntax using the indexing operator directly on the series itself. For instance, if you pass in an integer parameter, the operator will behave as if you want it to query via the iloc attribute. If you pass in an object, it will query as if you wanted to use the label based loc attribute.

If you wanted to see the fourth country on this, we would use the iloc attribute with the parameter 3.

s[3]
'South Korea'

If you wanted to see which country has golf as its national sport, we would use the loc attribute with parameter golf.

s['Golf']
'Scotland'

So what happens if your index is a list of integers? This is a bit complicated, and Pandas can't determine automatically whether you're intending to query by index position or index label. So you need to be careful when using the indexing operator on the series itself. And the safer option is to be more explicit and use the iloc or loc attributes directly.


sports = {99: 'Bhutan',
          100: 'Scotland',
          101: 'Japan',
          102: 'South Korea'}
s = pd.Series(sports)

s[99]
'Bhutan'

s.iloc[0]
'Bhutan'

s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-22-a5f43d492595> in <module>()
----> 1 s[0] #This won't call s.iloc[0] as one might expect, it generates an error instead

/opt/conda/lib/python3.6/site-packages/pandas/core/series.py in __getitem__(self, key)
    601         key = com._apply_if_callable(key, self)
    602         try:
--> 603             result = self.index.get_value(self, key)
    604 
    605             if not is_scalar(result):

/opt/conda/lib/python3.6/site-packages/pandas/indexes/base.py in get_value(self, series, key)
   2167         try:
   2168             return self._engine.get_value(s, k,
-> 2169                                           tz=getattr(series.dtype, 'tz', None))
   2170         except KeyError as e1:
   2171             if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3557)()

pandas/index.pyx in pandas.index.IndexEngine.get_value (pandas/index.c:3240)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8564)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:8508)()

KeyError: 0



Okay, so now we know how to get data out of the series. Let's talk about working with the data. A common task is to want to consider all of the values inside of a series and want to do some sort of operation. This could be trying to find a certain number, summarizing data or transforming the data in some way.

A typical programmatic approach to this would be to iterate over all the items in the series, and invoke the operation one is interested in. For instance, we could create a data frame of floating point values. Let's think of these as prices for different products. We could write a little routine which iterates over all of the items in the series and adds them together to get a total.

s = pd.Series([100.00, 120.00, 101.00, 3.00])
s
0    100.0
1    120.0
2    101.0
3      3.0
dtype: float64

total = 0
for item in s:
    total+=item
print(total)
324.0

This works, but it's slow. Modern computers can do many tasks simultaneously, especially, but not only, tasks involving mathematics.
Pandas and the underlying NumPy libraries support a method of computation called vectorization.
Vectorization works with most of the functions in the NumPy library, including the sum function.

Here's how we would really write the code using the NumPy sum method. First we need to import the numpy module, and then we just call np.sum and pass in an iterable item. In this case, our panda series.

import numpy as np

total = np.sum(s)
print(total)
324.0


Now both of these methods create the same value, but is one actually faster? The Jupyter Notebook has a magic function which can help. 

First, let's create a big series of random numbers. You'll see this used a lot when demonstrating techniques with Pandas. 

#this creates a big series of random numbers
s = pd.Series(np.random.randint(0,1000,10000))
s.head()
0    611
1    643
2    639
3    465
4    935
dtype: int64

Note that I've just used the head method, which reduces the amount of data printed out by the series to the first five elements. We can actually verify that length of the series is correct using the len function. 

total = 0
for item in s:
    total+=item
print(total)
427354

len(s)
10000

Magic functions begin with a percentage sign. If we type this sign and then hit the Tab key, we can see a list of the available magic functions. You could write your own magic functions too. 
We're actually going to use what's called a cellular magic function. These start with two percentage signs and modify a raptor code in the current Jupyter cell. The function we're going to use is called timeit. And as you may have guessed from the name, this function will run our code a few times to determine, on average, how long it takes.

Let's run timeit with our original iterative code. You can give timeit the number of loops that you would like to run. By default, we'll use 1,000 loops. I'll ask timeit here to use 100 runs because we're recording this.

%%timeit -n 100
summary = 0
for item in s:
    summary+=item
100 loops, best of 3: 2.48 ms per loop

Not bad. Timeit ran this code and it doesn't seem like it takes very long at all. Now let's try with vectorization.

%%timeit -n 100
summary = np.sum(s)
100 loops, best of 3: 150 µs per loop

Wow! This is a pretty shocking difference in the speed and demonstrates why data scientists need to be aware of parallel computing features and start thinking in functional programming terms.

Related feature in Pandas and NumPy is called broadcasting. With broadcasting, you can apply an operation to every value in the series, changing the series.

For instance, if we wanted to increase every random variable by 2, we could do so quickly using the += operator directly on the series object. Here I'll just use the head operator to just print out the top five rows in the series.

s+=2 #adds two to each item in s using broadcasting
s.head()
0    613
1    645
2    641
3    467
4    937
dtype: int64

The procedural way of doing this would be to iterate through all of the items in the series and increase the values directly. A quick aside here. Pandas does support iterating through a series much like a dictionary, allowing you to unpack values easily. But if you find yourself iterating through a series, you should question whether you're doing things in the best possible way.

Here's how we would do this using the series set value method.

for label, value in s.iteritems():
    s.set_value(label, value+2)
s.head()
0    615
1    647
2    643
3    469
4    939
dtype: int64

Let's try and time the two approaches.

%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2
10 loops, best of 3: 1.49 s per loop


%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
10 loops, best of 3: 487 µs per loop


Amazing. Not only is it significantly faster, but it's more concise and maybe even easier to read too. 

The typical mathematical operations you would expect are vectorized, and the NumPy documentation outlines what it takes to create vectorized functions of your own. One last note on using the indexing operators to access series data. 
The .loc attribute lets you not only modify data in place, but also add new data as well. If the value you pass in as the index doesn't exist, then a new entry is added. 
And keep in mind, indices can have mixed types. While it's important to be aware of the typing going on underneath, Pandas will automatically change the underlying NumPy types as appropriate.


Here's an example using a series of a few numbers. We could add some new value, maybe an animal, as you know, I like bears. Just by calling the .loc indexing operator. We see that mixed types for data values or index labels are no problem for Pandas.

s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s
0             1
1             2
2             3
Animal    Bears
dtype: object

Up until now shown only examples of a series where the index values were unique. An example where index values are not unique, and this makes data frames different, conceptually, that a relational database might be.

Revisiting the issue of countries and their national sports, it turns out that many countries seem to like this game cricket. We go back to our original series on sports. It's possible to create a new series object with multiple entries for cricket, and then use append to bring these together. 

There are a couple of important considerations when using append. 
First, Pandas is going to take your series and try to infer the best data types to use. In this example, everything is a string, so there's no problems here.
Second, the append method doesn't actually change the underlying series. It instead returns a new series which is made up of the two appended together. We can see this by going back and printing the original series of values and seeing that they haven't changed.

This is actually a significant issue for new Pandas users who are used to objects being changed in place. So watch out for it, not just with append but with other Pandas functions as well.

original_sports = pd.Series({'Archery': 'Bhutan',
                             'Golf': 'Scotland',
                             'Sumo': 'Japan',
                             'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                      'Barbados',
                                      'Pakistan',
                                      'England'], 
                                   index=['Cricket',
                                          'Cricket',
                                          'Cricket',
                                          'Cricket'])
all_countries = original_sports.append(cricket_loving_countries)

original_sports
Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

cricket_loving_countries
Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object

all_countries
Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
Cricket        Australia
Cricket         Barbados
Cricket         Pakistan
Cricket          England
dtype: object

Finally, we see that when we query the appended series for those who have cricket as their national sport, we don't get a single value, but a series itself. This is actually very common, and if you have a relational database background, this is very similar to every table query resulting in a return set which itself is a table.

all_countries.loc['Cricket']
Cricket    Australia
Cricket     Barbados
Cricket     Pakistan
Cricket      England
dtype: object


We focused on one of the primary data types of the Pandas library, the series. There are many more methods associated with this series object that we haven't talked about. 

But with these basics down, we'll move on to talking about the Panda's two-dimensional data structure, the data frame. The data frame is very similar to the series object, but includes multiple columns of data, and is the structure that you'll spend the majority of your time working with when cleaning and aggregating data.


The DataFrame Data Structure


The DataFrame data structure is the heart of the Panda's library. It's a primary object that you'll be working with in data analysis and cleaning tasks.

The DataFrame is conceptually a two-dimensional series object, where there's an index and multiple columns of content, with each column having a label. In fact, the distinction between a column and a row is really only a conceptual distinction. And you can think of the DataFrame itself as simply a two-axes labeled array.


Creating the DataFrame



You can create a DataFrame in many different ways, some of which you might expect. For instance, you can use a group of series, where each series represents a row of data. Or you could use a group of dictionaries, where each dictionary represents a row of data.

Let's take a look at an example. I'm going to create three purchase order records as series objects for a sort of fictional store.

Then we'll feed this into the DataFrame as the first argument and set index values indicating which store where each purchase was done. You'll see that when we print out a DataFrame,print it out as a table.

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df.head()
CostItem PurchasedName
Store 122.5Dog FoodChris
Store 12.5Kitty LitterKevyn
Store 25.0Bird SeedVinod


Similar to the series, we can extract data using the iLock and Lock attributes. Because the DataFrame is two-dimensional, passing a single value to the lock indexing operator will return series if there's only one row to return.

In this example, if we wanted to select data associated with Store 2, we would just query the lock attribute with one parameter.

df.loc['Store 2']
Cost                      5
Item Purchased    Bird Seed
Name                  Vinod
Name: Store 2, dtype: object

You'll note that the name of the series is returned as the row index value, while the column name is included in the output as well.

We can check the data type of the return using the python type function. It's important to remember that the indices and column names along either axes, horizontal or vertical, could be non-unique.

type(df.loc['Store 2'])
pandas.core.series.Series

For instance, in this example, we see two purchase records for Store 1 as different rows. If we use a single value with the DataFrame lock attribute, multiple rows of the DataFrame will return, not as a new series, but as a new DataFrame.

For instance, if we query for Store 1 records, we see that Chris and Kevin both shop at the same pets supply store.

df.loc['Store 1']

CostItem PurchasedName
Store 122.5Dog FoodChris
Store 12.5Kitty LitterKevyn


One of the powers of the Panda's DataFrame is that you can quickly select data based on multiple axes. For instance, if you wanted to just list the costs for Store 1, you would supply two parameters to .log, one being the row index and the other being the column name. If we're only interested in Store 1 costs, we could write this as df.lock('Store 1', 'Cost').


df.loc['Store 1', 'Cost']
Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

What if we just wanted to do column selection and just get a list of all of the costs? Well, there's a couple of options. First, you can get a transpose of the DataFrame, using the capital T attribute, which swaps all of the columns and rows. This essentially turns your column names into indices. And we can then use the .lock method. This works, but it's pretty ugly.

df.T
Store 1Store 1Store 2
Cost22.52.55
Item PurchasedDog FoodKitty LitterBird Seed
NameChrisKevynVinod


df.T.loc['Cost']
Store 1    22.5
Store 1     2.5
Store 2       5
Name: Cost, dtype: object

Since iloc and loc are used for row selection, the Panda's developers reserved indexing operator directly on the DataFrame for column selection. In a Panda's DataFrame, columns always have a name. So this selection is always label based, not as confusing as it was when using the square bracket operator on the series objects. For those familiar with relational databases, this operator is analogous to column projection.

df['Cost']
Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64


Finally, since the result of using the indexing operators, the DataFrame or series, you can chain operations together. For instance, we could have rewritten the query for all Store 1 costs as df.loc('Store 1', 'Cost').


df.loc['Store 1']['Cost']
Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

This looks pretty reasonable and gets us the result we wanted. But chaining can come with some costs and is best avoided if you can use another approach. In particular, chaining tends to cause Pandas to return a copy of the DataFrame instead of a view on the DataFrame. For selecting a data, this is not a big deal, though it might be slower than necessary. If you are changing data though, this is an important distinction and can be a source of error.


Here's another method. As we saw, .loc does row selection, and it can take two parameters, the row index and the list of column names. .loc also supports slicing. If we wanted to select all rows, we can use a column to indicate a full slice from beginning to end. And then add the column name as the second parameter as a string. In fact, if we wanted to include multiply columns, we could do so in a list. And Pandas will bring back only the columns we have asked for.

Here's an example, where we ask for all of the name and cost values for all stores using the .loc operator.

df.loc[:,['Name', 'Cost']]

NameCost
Store 1Chris22.5
Store 1Kevyn2.5
Store 2Vinod5.0


So that's selecting and projecting data from a DataFrame based on row and column labels. The key concepts to remember are that the rows and columns are really just for our benefit. Underneath this is just a two axes labeled array, and transposing the columns is easy.


Also, consider the issue of chaining carefully, and try to avoid it, it can cause unpredictable results. Where your intent was to obtain a view of the data, but instead Pandas returns to you a copy. In the Panda's world, friends don't let friends chain calls. So if you see it, point it out, and share a less ambiguous solution.



Dropping the data


It's easy to delete data in series and DataFrames, and we can use the drop function to do so. 
This function takes a single parameter, which is the index or roll label, to drop. This is another tricky place for new users to pad this. 
The drop function doesn't change the DataFrame by default. And instead, returns to you a copy of the DataFrame with the given rows removed. We can see that our original DataFrame is still intact. Let's make a copy with the copy method and do a drop on it instead. This is a very typical pattern in Pandas, where in place changes to a DataFrame are only done if need be, usually on changes involving indices. So it's important to be aware of.


df.drop('Store 1')
CostItem PurchasedName
Store 25.0Bird SeedVinod

df
CostItem PurchasedName
Store 122.5Dog FoodChris
Store 12.5Kitty LitterKevyn
Store 25.0Bird SeedVinod

copy_df = df.copy()
copy_df = copy_df.drop('Store 1')
copy_df
CostItem PurchasedName
Store 25.0Bird SeedVinod

Drop has two interesting optional parameters. The first is called in place, and if it's set to true, the DataFrame will be updated in place, instead of a copy being returned. The second parameter is the axis, which should be dropped. By default, this value is 0, indicating the row axis. But you could change it to 1 if you want to drop a column.

copy_df = df.copy()
copy_df.drop('Store 1',inplace=True)
copy_df
CostItem PurchasedNameLocation
Store 25.0Bird SeedVinodNone

copy_df = df.copy()
copy_df.drop('Cost',inplace=True,axis=1)
copy_df
Item PurchasedNameLocation
Store 1Dog FoodChrisNone
Store 1Kitty LitterKevynNone
Store 2Bird SeedVinodNone

There is a second way to drop a column, however. And that's directly through the use of the indexing operator, using the del keyword.

This way of dropping data, however, takes immediate effect on the DataFrame and does not return a view.

del copy_df['Name']
copy_df
CostItem Purchased
Store 25.0Bird Seed




Adding a Column


Finally, adding a new column to the DataFrame is as easy as assigning it to some value. 
For instance, if we wanted to add a new location as a column with default value of none, we could do so by using the assignment operator after the square brackets. This broadcasts the default value to the new column immediately.

df['Location'] = None
df
CostItem PurchasedNameLocation
Store 122.5Dog FoodChrisNone
Store 12.5Kitty LitterKevynNone
Store 25.0Bird SeedVinodNone


DataFrame Indexing and Loading



The common work flow is to read your data into a DataFrame then reduce this DataFrame to the particular columns or rows that you're interested in working with. As you've seen, the Panda's toolkit tries to give you views on a DataFrame. This is much faster than copying data and much more memory efficient too.

But it does mean that if you're manipulating the data you have to be aware that any changes to the DataFrame you're working on may have an impact on the base data frame you used originally.

We can create a series based on just the cost category using the square brackets. Then we can increase the cost in this series using broadcasting. Now if we look at our original DataFrame, we see those costs have risen as well. This is an important consideration to watch out for. If you want to explicitly use a copy, then you should consider calling the copy method on the DataFrame for it first.

costs = df['Cost']
costs

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64


costs+=2
costs

Store 1    24.5
Store 1     4.5
Store 2     7.0
Name: Cost, dtype: float64


df

CostItem PurchasedNameLocation
Store 124.5Dog FoodChrisNone
Store 14.5Kitty LitterKevynNone
Store 27.0Bird SeedVinodNone


A common workflow is to read the dataset in, usually from some external file. We saw previously how you can do this using Python, and lists, and dictionaries.

Pandas has built-in support for delimited files such as CSV files as well as a variety of other data formats including relational databases, Excel, and HTML tables.

CSV file called olympics.csv, which has data from Wikipedia that contains a summary list of the medal various countries have won at the Olympics.

!cat olympics.csv

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15
,№ Summer,01 !,02 !,03 !,Total,№ Winter,01 !,02 !,03 !,Total,№ Games,01 !,02 !,03 !,Combined total
Afghanistan (AFG),13,0,0,2,2,0,0,0,0,0,13,0,0,2,2
Algeria (ALG),12,5,2,8,15,3,0,0,0,0,15,5,2,8,15
Argentina (ARG),23,18,24,28,70,18,0,0,0,0,41,18,24,28,70
Armenia (ARM),5,1,2,9,12,6,0,0,0,0,11,1,2,9,12
Australasia (ANZ) [ANZ],2,3,4,5,12,0,0,0,0,0,2,3,4,5,12
Australia (AUS) [AUS] [Z],25,139,152,177,468,18,5,3,4,12,43,144,155,181,480
Austria (AUT),26,18,33,35,86,22,59,78,81,218,48,77,111,116,304
Azerbaijan (AZE),5,6,5,15,26,5,0,0,0,0,10,6,5,15,26
Bahamas (BAH),15,5,2,5,12,0,0,0,0,0,15,5,2,5,12
Bahrain (BRN),8,0,0,1,1,0,0,0,0,0,8,0,0,1,1
Barbados (BAR) [BAR],11,0,0,1,1,0,0,0,0,0,11,0,0,1,1
Belarus (BLR),5,12,24,39,75,6,6,4,5,15,11,18,28,44,90
Belgium (BEL),25,37,52,53,142,20,1,1,3,5,45,38,53,56,147
Bermuda (BER),17,0,0,1,1,7,0,0,0,0,24,0,0,1,1
Bohemia (BOH) [BOH] [Z],3,0,1,3,4,0,0,0,0,0,3,0,1,3,4
Botswana (BOT),9,0,1,0,1,0,0,0,0,0,9,0,1,0,1
Brazil (BRA),21,23,30,55,108,7,0,0,0,0,28,23,30,55,108
British West Indies (BWI) [BWI],1,0,0,2,2,0,0,0,0,0,1,0,0,2,2
Bulgaria (BUL) [H],19,51,85,78,214,19,1,2,3,6,38,52,87,81,220
Burundi (BDI),5,1,0,0,1,0,0,0,0,0,5,1,0,0,1
Cameroon (CMR),13,3,1,1,5,1,0,0,0,0,14,3,1,1,5
Canada (CAN),25,59,99,121,279,22,62,56,52,170,47,121,155,173,449
Chile (CHI) [I],22,2,7,4,13,16,0,0,0,0,38,2,7,4,13
China (CHN) [CHN],9,201,146,126,473,10,12,22,19,53,19,213,168,145,526
Colombia (COL),18,2,6,11,19,1,0,0,0,0,19,2,6,11,19
Costa Rica (CRC),14,1,1,2,4,6,0,0,0,0,20,1,1,2,4
Ivory Coast (CIV) [CIV],12,0,1,0,1,0,0,0,0,0,12,0,1,0,1
Croatia (CRO),6,6,7,10,23,7,4,6,1,11,13,10,13,11,34
Cuba (CUB) [Z],19,72,67,70,209,0,0,0,0,0,19,72,67,70,209
Cyprus (CYP),9,0,1,0,1,10,0,0,0,0,19,0,1,0,1
Czech Republic (CZE) [CZE],5,14,15,15,44,6,7,9,8,24,11,21,24,23,68
Czechoslovakia (TCH) [TCH],16,49,49,45,143,16,2,8,15,25,32,51,57,60,168
Denmark (DEN) [Z],26,43,68,68,179,13,0,1,0,1,39,43,69,68,180
Djibouti (DJI) [B],7,0,0,1,1,0,0,0,0,0,7,0,0,1,1
Dominican Republic (DOM),13,3,2,1,6,0,0,0,0,0,13,3,2,1,6
Ecuador (ECU),13,1,1,0,2,0,0,0,0,0,13,1,1,0,2
Egypt (EGY) [EGY] [Z],21,7,9,10,26,1,0,0,0,0,22,7,9,10,26
Eritrea (ERI),4,0,0,1,1,0,0,0,0,0,4,0,0,1,1
Estonia (EST),11,9,9,15,33,9,4,2,1,7,20,13,11,16,40
Ethiopia (ETH),12,21,7,17,45,2,0,0,0,0,14,21,7,17,45
Finland (FIN),24,101,84,117,302,22,42,62,57,161,46,143,146,174,463
France (FRA) [O] [P] [Z],27,202,223,246,671,22,31,31,47,109,49,233,254,293,780
Gabon (GAB),9,0,1,0,1,0,0,0,0,0,9,0,1,0,1
Georgia (GEO),5,6,5,14,25,6,0,0,0,0,11,6,5,14,25
Germany (GER) [GER] [Z],15,174,182,217,573,11,78,78,53,209,26,252,260,270,782
United Team of Germany (EUA) [EUA],3,28,54,36,118,3,8,6,5,19,6,36,60,41,137
East Germany (GDR) [GDR],5,153,129,127,409,6,39,36,35,110,11,192,165,162,519
West Germany (FRG) [FRG],5,56,67,81,204,6,11,15,13,39,11,67,82,94,243
Ghana (GHA) [GHA],13,0,1,3,4,1,0,0,0,0,14,0,1,3,4
Great Britain (GBR) [GBR] [Z],27,236,272,272,780,22,10,4,12,26,49,246,276,284,806
Greece (GRE) [Z],27,30,42,39,111,18,0,0,0,0,45,30,42,39,111
Grenada (GRN),8,1,0,0,1,0,0,0,0,0,8,1,0,0,1
Guatemala (GUA),13,0,1,0,1,1,0,0,0,0,14,0,1,0,1
Guyana (GUY) [GUY],16,0,0,1,1,0,0,0,0,0,16,0,0,1,1
Haiti (HAI) [J],14,0,1,1,2,0,0,0,0,0,14,0,1,1,2
Hong Kong (HKG) [HKG],15,1,1,1,3,4,0,0,0,0,19,1,1,1,3
Hungary (HUN),25,167,144,165,476,22,0,2,4,6,47,167,146,169,482
Iceland (ISL),19,0,2,2,4,17,0,0,0,0,36,0,2,2,4
India (IND) [F],23,9,6,11,26,9,0,0,0,0,32,9,6,11,26
Indonesia (INA),14,6,10,11,27,0,0,0,0,0,14,6,10,11,27
Iran (IRI) [K],15,15,20,25,60,10,0,0,0,0,25,15,20,25,60
Iraq (IRQ),13,0,0,1,1,0,0,0,0,0,13,0,0,1,1
Ireland (IRL),20,9,8,12,29,6,0,0,0,0,26,9,8,12,29
Israel (ISR),15,1,1,5,7,6,0,0,0,0,21,1,1,5,7
Italy (ITA) [M] [S],26,198,166,185,549,22,37,34,43,114,48,235,200,228,663
Jamaica (JAM) [JAM],16,17,30,20,67,7,0,0,0,0,23,17,30,20,67
Japan (JPN),21,130,126,142,398,20,10,17,18,45,41,140,143,160,443
Kazakhstan (KAZ),5,16,17,19,52,6,1,3,3,7,11,17,20,22,59
Kenya (KEN),13,25,32,29,86,3,0,0,0,0,16,25,32,29,86
North Korea (PRK),9,14,12,21,47,8,0,1,1,2,17,14,13,22,49
South Korea (KOR),16,81,82,80,243,17,26,17,10,53,33,107,99,90,296
Kuwait (KUW),12,0,0,2,2,0,0,0,0,0,12,0,0,2,2
Kyrgyzstan (KGZ),5,0,1,2,3,6,0,0,0,0,11,0,1,2,3
Latvia (LAT),10,3,11,5,19,10,0,4,3,7,20,3,15,8,26
Lebanon (LIB),16,0,2,2,4,16,0,0,0,0,32,0,2,2,4
Liechtenstein (LIE),16,0,0,0,0,18,2,2,5,9,34,2,2,5,9
Lithuania (LTU),8,6,5,10,21,8,0,0,0,0,16,6,5,10,21
Luxembourg (LUX) [O],22,1,1,0,2,8,0,2,0,2,30,1,3,0,4
Macedonia (MKD),5,0,0,1,1,5,0,0,0,0,10,0,0,1,1
Malaysia (MAS) [MAS],12,0,3,3,6,0,0,0,0,0,12,0,3,3,6
Mauritius (MRI),8,0,0,1,1,0,0,0,0,0,8,0,0,1,1
Mexico (MEX),22,13,21,28,62,8,0,0,0,0,30,13,21,28,62
Moldova (MDA),5,0,2,5,7,6,0,0,0,0,11,0,2,5,7
Mongolia (MGL),12,2,9,13,24,13,0,0,0,0,25,2,9,13,24
Montenegro (MNE),2,0,1,0,1,2,0,0,0,0,4,0,1,0,1
Morocco (MAR),13,6,5,11,22,6,0,0,0,0,19,6,5,11,22
Mozambique (MOZ),9,1,0,1,2,0,0,0,0,0,9,1,0,1,2
Namibia (NAM),6,0,4,0,4,0,0,0,0,0,6,0,4,0,4
Netherlands (NED) [Z],25,77,85,104,266,20,37,38,35,110,45,114,123,139,376
Netherlands Antilles (AHO) [AHO] [I],13,0,1,0,1,2,0,0,0,0,15,0,1,0,1
New Zealand (NZL) [NZL],22,42,18,39,99,15,0,1,0,1,37,42,19,39,100
Niger (NIG),11,0,0,1,1,0,0,0,0,0,11,0,0,1,1
Nigeria (NGR),15,3,8,12,23,0,0,0,0,0,15,3,8,12,23
Norway (NOR) [Q],24,56,49,43,148,22,118,111,100,329,46,174,160,143,477
Pakistan (PAK),16,3,3,4,10,2,0,0,0,0,18,3,3,4,10
Panama (PAN),16,1,0,2,3,0,0,0,0,0,16,1,0,2,3
Paraguay (PAR),11,0,1,0,1,1,0,0,0,0,12,0,1,0,1
Peru (PER) [L],17,1,3,0,4,2,0,0,0,0,19,1,3,0,4
Philippines (PHI),20,0,2,7,9,4,0,0,0,0,24,0,2,7,9
Poland (POL),20,64,82,125,271,22,6,7,7,20,42,70,89,132,291
Portugal (POR),23,4,8,11,23,7,0,0,0,0,30,4,8,11,23
Puerto Rico (PUR),17,0,2,6,8,6,0,0,0,0,23,0,2,6,8
Qatar (QAT),8,0,0,4,4,0,0,0,0,0,8,0,0,4,4
Romania (ROU),20,88,94,119,301,20,0,0,1,1,40,88,94,120,302
Russia (RUS) [RUS],5,132,121,142,395,6,49,40,35,124,11,181,161,177,519
Russian Empire (RU1) [RU1],3,1,4,3,8,0,0,0,0,0,3,1,4,3,8
Soviet Union (URS) [URS],9,395,319,296,1010,9,78,57,59,194,18,473,376,355,1204
Unified Team (EUN) [EUN],1,45,38,29,112,1,9,6,8,23,2,54,44,37,135
Saudi Arabia (KSA),10,0,1,2,3,0,0,0,0,0,10,0,1,2,3
Senegal (SEN),13,0,1,0,1,5,0,0,0,0,18,0,1,0,1
Serbia (SRB) [SRB],3,1,2,4,7,2,0,0,0,0,5,1,2,4,7
Serbia and Montenegro (SCG) [SCG],3,2,4,3,9,3,0,0,0,0,6,2,4,3,9
Singapore (SIN),15,0,2,2,4,0,0,0,0,0,15,0,2,2,4
Slovakia (SVK) [SVK],5,7,9,8,24,6,2,2,1,5,11,9,11,9,29
Slovenia (SLO),6,4,6,9,19,7,2,4,9,15,13,6,10,18,34
South Africa (RSA),18,23,26,27,76,6,0,0,0,0,24,23,26,27,76
Spain (ESP) [Z],22,37,59,35,131,19,1,0,1,2,41,38,59,36,133
Sri Lanka (SRI) [SRI],16,0,2,0,2,0,0,0,0,0,16,0,2,0,2
Sudan (SUD),11,0,1,0,1,0,0,0,0,0,11,0,1,0,1
Suriname (SUR) [E],11,1,0,1,2,0,0,0,0,0,11,1,0,1,2
Sweden (SWE) [Z],26,143,164,176,483,22,50,40,54,144,48,193,204,230,627
Switzerland (SUI),27,47,73,65,185,22,50,40,48,138,49,97,113,113,323
Syria (SYR),12,1,1,1,3,0,0,0,0,0,12,1,1,1,3
Chinese Taipei (TPE) [TPE] [TPE2],13,2,7,12,21,11,0,0,0,0,24,2,7,12,21
Tajikistan (TJK),5,0,1,2,3,4,0,0,0,0,9,0,1,2,3
Tanzania (TAN) [TAN],12,0,2,0,2,0,0,0,0,0,12,0,2,0,2
Thailand (THA),15,7,6,11,24,3,0,0,0,0,18,7,6,11,24
Togo (TOG),9,0,0,1,1,1,0,0,0,0,10,0,0,1,1
Tonga (TGA),8,0,1,0,1,1,0,0,0,0,9,0,1,0,1
Trinidad and Tobago (TRI) [TRI],16,2,5,11,18,3,0,0,0,0,19,2,5,11,18
Tunisia (TUN),13,3,3,4,10,0,0,0,0,0,13,3,3,4,10
Turkey (TUR),21,39,25,24,88,16,0,0,0,0,37,39,25,24,88
Uganda (UGA),14,2,3,2,7,0,0,0,0,0,14,2,3,2,7
Ukraine (UKR),5,33,27,55,115,6,2,1,4,7,11,35,28,59,122
United Arab Emirates (UAE),8,1,0,0,1,0,0,0,0,0,8,1,0,0,1
United States (USA) [P] [Q] [R] [Z],26,976,757,666,2399,22,96,102,84,282,48,1072,859,750,2681
Uruguay (URU),20,2,2,6,10,1,0,0,0,0,21,2,2,6,10
Uzbekistan (UZB),5,5,5,10,20,6,1,0,0,1,11,6,5,10,21
Venezuela (VEN),17,2,2,8,12,4,0,0,0,0,21,2,2,8,12
Vietnam (VIE),14,0,2,0,2,0,0,0,0,0,14,0,2,0,2
Virgin Islands (ISV),11,0,1,0,1,7,0,0,0,0,18,0,1,0,1
Yugoslavia (YUG) [YUG],16,26,29,28,83,14,0,3,1,4,30,26,32,29,87
Independent Olympic Participants (IOP) [IOP],1,0,1,2,3,0,0,0,0,0,1,0,1,2,3
Zambia (ZAM) [ZAM],12,0,1,1,2,0,0,0,0,0,12,0,1,1,2
Zimbabwe (ZIM) [ZIM],12,3,4,1,8,1,0,0,0,0,13,3,4,1,8
Mixed team (ZZX) [ZZX],3,8,5,4,17,0,0,0,0,0,3,8,5,4,17
Totals,27,4809,4775,5130,14714,22,959,958,948,2865,49,5768,5733,6078,17579


Read this into a DataFrame by calling the read_csv function of the module. When we look at the DataFrame we see that the first cell has an NaN in it since it's an empty value, and the rows have been automatically indexed for us.

It seems pretty clear that the first row of data in the DataFrame is what we really want to see as the column names. It also seems like the first column in the data is the country name, which we would like to make an index.

df = pd.read_csv('olympics.csv')
df.head()


0123456789101112131415
0NaN№ Summer01 !02 !03 !Total№ Winter01 !02 !03 !Total№ Games01 !02 !03 !Combined total
1Afghanistan (AFG)13002200000130022
2Algeria (ALG)1252815300001552815
3Argentina (ARG)23182428701800004118242870
4Armenia (ARM)512912600001112912


Read csv has a number of parameters that we can use to indicate to Pandas how rows and columns should be labeled.

Use the index call to indicate which column should be the index and we can also use the header parameter to indicate which row from the data file should be used as the header.

Let's re-import that data and center index value to be 0 which is the first column and let set a column headers to be read from the second row of data. We can do this by using the skip rows parameters, to tell Pandas to ignore the first row, which was made up of numeric column names.

Now this data came from the all time Olympic games medal table on Wikipedia. If we head to the page we could see that instead of running gold, silver and bronze in the pages, these nice little icons with a one, a two, and a three in them In our csv file these were represented with the strings 01 !, 02 !, and so on. We see that the column values are repeated which really isn't good practice. Panda's recognize this in a panda.1 and .2 to make things more unique.

But this labeling isn't really as clear as it could be, so we should clean up the data file. We can of course do this just by going and editing the CSV file directly, but we can also set the column names using the Pandas name property.

df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()


№ Summer01 !02 !03 !Total№ Winter01 !.102 !.103 !.1Total.1№ Games01 !.202 !.203 !.2Combined total
Afghanistan (AFG)13002200000130022
Algeria (ALG)1252815300001552815
Argentina (ARG)23182428701800004118242870
Armenia (ARM)512912600001112912
Australasia (ANZ) [ANZ]23451200000234512

df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True) 

df.head()


# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined total
Afghanistan (AFG)13002200000130022
Algeria (ALG)1252815300001552815
Argentina (ARG)23182428701800004118242870
Armenia (ARM)512912600001112912
Australasia (ANZ) [ANZ]23451200000234512




Querying a DataFrame


A Boolean mask is an array which can be of one dimension like a series, or two dimensions like a data frame, where each of the values in the array are either true or false. This array is essentially overlaid on top of the data structure that we're querying. And any cell aligned with the true value will be admitted into our final result, and any sign aligned with a false value will not.

Boolean masking is powerful conceptually and is the cornerstone of efficient NumPy and pandas querying.

To build a Boolean mask for this query, we project the gold column using the indexing operator and apply the greater than operator with a comparison value of zero. This is essentially broadcasting a comparison operator, greater than, with the results being returned as a Boolean series. 

The resultant series is indexed where the value of each cell is either true or false depending on whether a country has won at least one gold medal, and the index is the country name.

df['Gold'] > 0

Afghanistan (AFG)                               False
Algeria (ALG)                                    True
Argentina (ARG)                                  True
Armenia (ARM)                                    True
Australasia (ANZ) [ANZ]                          True


Overlay that mask on the data frame. We can do this using the where function. The where function takes a Boolean mask as a condition, applies it to the data frame or series, and returns a new data frame or series of the same shape.

Only data from countries that met the condition are retained. All of the countries which did not meet the condition have NaN data instead. This is okay. Most statistical functions built into the data frame object ignore values of NaN.

only_gold = df.where(df['Gold'] > 0)
only_gold.head()


# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined total
Afghanistan (AFG)NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
Algeria (ALG)12.05.02.08.015.03.00.00.00.00.015.05.02.08.015.0
Argentina (ARG)23.018.024.028.070.018.00.00.00.00.041.018.024.028.070.0
Armenia (ARM)5.01.02.09.012.06.00.00.00.00.011.01.02.09.012.0
Australasia (ANZ) [ANZ]2.03.04.05.012.00.00.00.00.00.02.03.04.05.012.0

If we call the df.count on the only gold data frame, we see that there are 100 countries which have had gold medals awarded at the summer games, while if we call count on the original data frame, we see that there are 147 countries total.

only_gold['Gold'].count()
100

df['Gold'].count()
147

Often we want to drop those rows which have no data. To do this, we can use the drop NA function. You can optionally provide drop NA the axes it should be considering. Remember that the axes is just an indicator for the columns or rows and that the default is zero, which means rows.

only_gold = only_gold.dropna()
only_gold.head()

# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined total
Algeria (ALG)12.05.02.08.015.03.00.00.00.00.015.05.02.08.015.0
Argentina (ARG)23.018.024.028.070.018.00.00.00.00.041.018.024.028.070.0
Armenia (ARM)5.01.02.09.012.06.00.00.00.00.011.01.02.09.012.0
Australasia (ANZ) [ANZ]2.03.04.05.012.00.00.00.00.00.02.03.04.05.012.0
Australia (AUS) [AUS] [Z]25.0139.0152.0177.0468.018.05.03.04.012.043.0144.0155.0181.0480.0

There are no NaNs when you query the data frame in this manner. pandas automatically filters out the rows with now values.

only_gold = df[df['Gold'] > 0]
only_gold.head()

# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined total
Algeria (ALG)1252815300001552815
Argentina (ARG)23182428701800004118242870
Armenia (ARM)512912600001112912
Australasia (ANZ) [ANZ]23451200000234512
Australia (AUS) [AUS] [Z]25139152177468185341243144155181480

You can chain together a bunch of and/or statements in order to create more complex queries, and the result is a single Boolean mask.

len(df[(df['Gold'] > 0) | (df['Gold.1'] > 0)])
101

Countries who have only won a gold in the winter Olympics and never in the summer Olympics

df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]

# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined total
Liechtenstein (LIE)160000182259342259


Extremely important, and often an issue for new users, is to remember that each Boolean mask needs to be encased in parenthesis because of the order of operations. This can cause no end of frustration if you're not used to it, so be careful.

Indexing DataFrames



Both series and DataFrames can have indices applied to them. The index is essentially a row level label, and we know that rows correspond to axis zero. In our Olympics data, we indexed the data frame by the name of the country. 

Indices can either be inferred, such as when we create a new series without an index, in which case we get numeric values, or they can be set explicitly, like when we use the dictionary object to create the series, or when we loaded data from the CSV file and specified the header. 

Another option for setting an index is to use the set_index function. This function takes a list of columns and promotes those columns to an index. Set index is a destructive process, it doesn't keep the current index. If you want to keep the current index, you need to manually create a new column and copy into it values from the index attribute.

Let's go back to our Olympics DataFrame.

df.head()

# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined total
Afghanistan (AFG)13002200000130022
Algeria (ALG)1252815300001552815
Argentina (ARG)23182428701800004118242870
Armenia (ARM)512912600001112912
Australasia (ANZ) [ANZ]23451200000234512


Let's say that we don't want to index the DataFrame by countries, but instead want to index by the number of gold medals that were won at summer games. First we need to preserve the country information into a new column. We can do this using the indexing operator or the string that has the column label. Then we can use the set_index to set index of the column to summer gold medal wins.

When we create a new index from an existing column it appears that a new first row has been added with empty values. This isn't quite what's happening. And we know this in part because an empty value is actually rendered either as a none or an NaN if the data type of the column is numeric. What's actually happened is that the index has a name. Whatever the column name was in the Jupiter notebook has just provided this in the output.

df['country'] = df.index
df = df.set_index('Gold')
df.head()
# SummerSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined totalcountry
Gold
01302200000130022Afghanistan (AFG)
5122815300001552815Algeria (ALG)
18232428701800004118242870Argentina (ARG)
152912600001112912Armenia (ARM)
32451200000234512Australasia (ANZ) [ANZ]


We can get rid of the index completely by calling the function reset_index. This promotes the index into a column and creates a default numbered index.

df = df.reset_index()
df.head()
Gold# SummerSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined totalcountry
001302200000130022Afghanistan (AFG)
15122815300001552815Algeria (ALG)
218232428701800004118242870Argentina (ARG)
3152912600001112912Armenia (ARM)
432451200000234512Australasia (ANZ) [ANZ]

One nice feature of pandas is that it has the option to do multi-level indexing. This is similar to composite keys in relational database systems. To create a multi-level index, we simply call set index and give it a list of columns that we're interested in promoting to an index.

It's a great example of how different kinds of data sets might be formatted when you're trying to clean them.

For instance, in this data set there are two summarized levels, one that contains summary data for the whole country. And one that contains summary data for each state, and one that contains summary data for each county. 

df = pd.read_csv('census.csv')
df.head()
<data>
5 rows × 100 columns

I often find that I want to see a list of all the unique values in a given column. In this DataFrame, we see that the possible values for the sum level are using the unique function on the DataFrame. This is similar to the SQL distinct operator.

df['SUMLEV'].unique()
array([40, 50])

Here we can run unique on the sum level of our current DataFrame and see that there are only two different values, 40 and 50. Let's get rid of all of the rows that are summaries at the state level and just keep the county data.

df=df[df['SUMLEV'] == 50]
df.head()
<data>
5 rows × 100 columns

Also while this data set is interesting for a number of different reasons, let's reduce the data that we're going to look at to just the total population estimates and the total number of births. We can do this by creating a list of column names that we want to keep then project those and assign the resulting DataFrame to our df variable.

columns_to_keep = ['STNAME',
                   'CTYNAME',
                   'BIRTHS2010',
                   'BIRTHS2011',
                   'BIRTHS2012',
                   'BIRTHS2013',
                   'BIRTHS2014',
                   'BIRTHS2015',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

STNAME
CTYNAMEBIRTHS2010BIRTHS2011BIRTHS2012BIRTHS2013BIRTHS2014BIRTHS2015POPESTIMATE2010POPESTIMATE2011
1AlabamaAutauga County15163661557462360054660

2AlabamaBaldwin County51721872092216021862240183193

3AlabamaBarbour County7033530028326026927341
4AlabamaBibb County4426624525924725322861
5AlabamaBlount County18374471064661860357373


The US Census data breaks down estimates of population data by state and county. We can load the data and set the index to be a combination of the state and county values.

df = df.set_index(['STNAME', 'CTYNAME'])
df.head()
BIRTHS2010BIRTHS2011BIRTHS2012BIRTHS2013BIRTHS2014BIRTHS2015POPESTIMATE2010POPESTIMATE2011POPESTIMATE2012
STNAMECTYNAME
AlabamaAutauga County1516366155746236005466055253
Baldwin County51721872092216021862240183193186659
Barbour County703353002832602692734127226
Bibb County442662452592472532286122733
Blount County1837447106466186035737357711


An immediate question which comes up is how we can query this DataFrame.

When you use a MultiIndex, you must provide the arguments in order by the level you wish to query.

df.loc['Michigan', 'Washtenaw County']
BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

df.loc[ [('Michigan', 'Washtenaw County'),
         ('Michigan', 'Wayne County')] ]
BIRTHS2010BIRTHS2011BIRTHS2012BIRTHS2013BIRTHS2014BIRTHS2015POPESTIMATE2010
STNAMECTYNAME
MichiganWashtenaw County97738263780366236833709345563
Wayne County591823819232702337723607235861815199

Of course hierarchical labeling isn't just for rows. For example, you can transpose this matrix and now have hierarchical column labels.


Add an entry


Reindex to make first bu store and then by person and add new entry

import pandas as pd

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})

df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])


# Your answer here
df['Location'] = df.index
df = df.set_index(['Location', 'Name'])

df = df.append(pd.Series(data={'Cost': 3.00, 'Item Purchased': 'Kitty Food'}, name=('Store 2', 'Kevyn')))

# Alternate approach
#
# purchase_4 = pd.Series({'Item Purchased': 'Kitty Food','Cost': 3.00})
# cols = ['Cost','Item Purchased']
# df.loc[('Store 2', 'Kevyn'), cols] = purchase_4.values

df


CostItem Purchased
LocationName
Store 1Chris22.5Dog Food
Kevyn2.5Kitty Litter
Store 2Vinod5.0Bird Seed
Kevyn3.0Kitty Food



Missing Values



We've seen a preview of how Pandas handles missing values using the None type and NumPy NaN values. Missing values are pretty common in data cleaning activities.

First, the built in loading from delimited files provides control for missing values in a few ways. The most germane of these, is the na_values list, to indicate other strings which could refer to missing values. Some of my sociologist colleagues for instance, regularly use the value of 99 in binary categories to indicate that there's no value.

You can also use the na_filter option to turn off white space filtering, if white space is an actual value of interest. But in practice, this is pretty rare.

In addition to rules controlling how missing values might be loaded, it's sometimes useful to consider missing values as actually having information.

I often deal with logs from online learning systems. In particular, I've done a couple of projects looking at video use in lecture capture systems. In these systems it's common for the player for have a heartbeat functionality where playback statistics are sent to the server every so often, maybe every 30 seconds.

These heartbeats can get big as they can carry the whole state of the playback system, such as where the video play head is at, where the video size is, which video is being rendered to the screen, how loud the volume is, etc.

If we load the data file log.txt, we can see an example of what this might look like. In this data the first column is a timestamp in the Unix epoch format. The next column is the user name followed by a web page they're visiting and the video that they're playing.

Each row of the DataFrame has a playback position. And we can see that as the playback position increases by one, the time stamp increases by about 30 seconds.

Except for user Bob. It turns out that Bob has paused his playback so as time increases the playback position doesn't change. Note too how difficult it is for us to try and derive this knowledge from the data, because it's not sorted by time stamp as one might expect. This is actually not uncommon on systems which have a high degree of parallelism.

There are a lot of missing values in the paused and volume columns. It's not efficient to send this information across the network if it hasn't changed. So this particular system just inserts null values into the database if there's no changes.


One of the handy functions that Pandas has for working with missing values is the filling function, fillna. This function takes a number or parameters, for instance, you could pass in a single value which is called a scalar value to change all of the missing data to one value.

The two common fill values are ffill and bfill. ffill is for forward filling and it updates an na value for a particular cell with the value from the previous row. It's important to note that your data needs to be sorted in order for this to have the effect you might want. Data that comes from traditional database management systems usually has no order guarantee, just like this data. So be careful.

df = pd.read_csv('log.csv')
df

timeuservideoplayback positionpausedvolume
01469974424cherylintro.html5False10.0
11469974454cherylintro.html6NaNNaN
21469974544cherylintro.html9NaNNaN
31469974574cherylintro.html10NaNNaN
41469977514bobintro.html1NaNNaN

df.fillna?

In Pandas we can sort either by index or by values. Here we'll just promote the time stamp to an index then sort on the index.

df = df.set_index('time')
df = df.sort_index()
df
uservideoplayback positionpausedvolume
time
1469974424cherylintro.html5False10.0
1469974424sueadvanced.html23False10.0
1469974454cherylintro.html6NaNNaN
1469974454sueadvanced.html24NaNNaN
1469974484cherylintro.html7NaNNaN

If we look closely at the output though we'll notice that the index isn't really unique. Two users seem to be able to use the system at the same time. Again, a very common case.

Let's reset the index, and use some multi-level indexing instead, and promote the user name to a second level of the index to deal with that issue.

df = df.reset_index()
df = df.set_index(['time', 'user'])
df

videoplayback positionpausedvolume
timeuser
1469974424cherylintro.html5False10.0
sueadvanced.html23False10.0
1469974454cherylintro.html6NaNNaN
sueadvanced.html24NaNNaN
1469974484cherylintro.html7NaNNaN

Now that we have the data indexed and sorted appropriately, we can fill the missing datas using ffill.

df = df.fillna(method='ffill')
df.head()

videoplayback positionpausedvolume
timeuser
1469974424cherylintro.html5False10.0
sueadvanced.html23False10.0
1469974454cherylintro.html6False10.0
sueadvanced.html24False10.0
1469974484cherylintro.html7False10.0

It's sometimes useful to use forward filling, sometimes backwards filling, and sometimes useful to just use a single number. More recently, the Pandas team introduced a method of filling missing values with a series which is the same length as your DataFrame. This makes it easy to derive values which are missing if you have the underlying to do so. 

For instance, if you're dealing with receipts and you have a column for final price and a column for discount but are missing information from the original price column, you can fill this automatically using fillna.

When you use statistical functions on DataFrames, these functions typically ignore missing values. For instance if you try and calculate the mean value of a DataFrame, the underlying NumPy function will ignore missing values. This is usually what you want but you should be aware that values are being excluded.


Misc


The following code loads the olympics dataset (olympics.csv), which was derrived from the Wikipedia entry on All Time Olympic Games Medals, and does some basic data cleaning.

The columns are organized as # of Summer games, Summer medals, # of Winter games, Winter medals, total # number of games, total # of medals. Use this dataset to answer the questions below.


import pandas as pd

df = pd.read_csv('olympics.csv', index_col=0, skiprows=1)

for col in df.columns:

    if col[:2]=='01':

        df.rename(columns={col:'Gold'+col[4:]}, inplace=True)

    if col[:2]=='02':

        df.rename(columns={col:'Silver'+col[4:]}, inplace=True)

    if col[:2]=='03':

        df.rename(columns={col:'Bronze'+col[4:]}, inplace=True)

    if col[:1]=='№':

        df.rename(columns={col:'#'+col[1:]}, inplace=True)


names_ids = df.index.str.split('\s\(') # split the index by '('

df.index = names_ids.str[0] # the [0] element is the country name (new index) 

df['ID'] = names_ids.str[1].str[:3] # the [1] element is the abbreviation or ID (take first 3 characters from that)


df = df.drop('Totals')

df.head()


# SummerGoldSilverBronzeTotal# WinterGold.1Silver.1Bronze.1Total.1# GamesGold.2Silver.2Bronze.2Combined totalID
Afghanistan13002200000130022AFG
Algeria1252815300001552815ALG
Argentina23182428701800004118242870ARG
Armenia512912600001112912ARM
Australasia23451200000234512ANZ

### Question 0 

(Example) What is the first country in df? *This function should return a Series.*

def answer_zero():

    # This function returns the row for Afghanistan, which is a Series object. The assignment

    # question description will tell you the general format the autograder is expecting

    return df.iloc[0]

# You can examine what your function returns by calling it in the cell. If you have questions

# about the assignment formats, check out the discussion forums for any FAQs

answer_zero() 


### Question 1 

Which country has won the most gold medals in summer games? *This function should return a single string value.*


def answer_one():
    return (df.sort_values(by='Gold', ascending=False).iloc[0].name)
answer_one()

'United States'

### Question 2 
Which country had the biggest difference between their summer and winter gold medal counts? *This function should return a single string value.*

def answer_two():
    df['diff_two']=abs(df['Gold']-df['Gold.1'])
    ans_two = (df.sort_values(by='diff_two', ascending=False).iloc[0].name)
    df.drop('diff_two',inplace=True,axis=1)
    return ans_two

answer_two()

'United States'

### Question 3 
Which country has the biggest difference between their summer gold medal counts and winter gold medal counts relative to their total gold medal count? 
{Summer-Gold - Winter-Gold}/{Total-Gold} 
Only include countries that have won at least 1 gold in both summer and winter. 
*This function should return a single string value.*

def answer_three():
    only_gold = df[(df['Gold'] > 0) & (df['Gold.1'] > 0)]
    only_gold['diff_three']=abs(only_gold['Gold']-only_gold['Gold.1'])/only_gold['Gold.2']
    ans_three = (only_gold.sort_values(by='diff_three', ascending=False).iloc[0].name)
    only_gold.drop('diff_three',inplace=True,axis=1)
    return ans_three

answer_three()
'Bulgaria'

### Question 4 
Write a function that creates a Series called "Points" which is a weighted value where each gold medal (`Gold.2`) counts for 3 points, silver medals (`Silver.2`) for 2 points, and bronze medals (`Bronze.2`) for 1 point. 
The function should return only the column (a Series object) which you created, with the country names as indices. 
*This function should return a Series named `Points` of length 146*

def answer_four():
    df['diff_four']=abs(df['Gold.2']*3+df['Silver.2']*2+df['Bronze.2']*1)
    ans_four = df['diff_four']
    df.drop('diff_four',inplace=True,axis=1)
    return ans_four

answer_four()
Afghanistan                            2
Algeria                               27
Argentina                            130
Armenia                               16
Australasia                           22


### Question 5 
Which state has the most counties in it? (hint: consider the sumlevel key carefully! You'll need this for future questions too...) 
*This function should return a single string value.*

census_df = pd.read_csv('census.csv')
census_df.head()


SUMLEVREGIONDIVISIONSTATECOUNTYSTNAMECTYNAMECENSUS2010POPESTIMATESBASE2010POPESTIMATE2010

0403610AlabamaAlabama47797364780127

1503611AlabamaAutauga County5457154571


2503613AlabamaBaldwin County182265182265


3503615AlabamaBarbour County2745727457

4503617AlabamaBibb County2291522919


def answer_five():
    copy_df = census_df.copy()
    unique_states = copy_df['STNAME'].unique()
    copy_df = copy_df.set_index('STNAME')
    
    max_count = 0
    max_state = None
    for st in unique_states:
        if len(copy_df.loc[st]) > max_count:
            max_count = len(copy_df.loc[st])
            max_state = st
            
    return max_state

answer_five()
'Texas'


### Question 6 
**Only looking at the three most populous counties for each state**, what are the three most populous states (in order of highest population to lowest population)? Use `CENSUS2010POP`. *This function should return a list of string values.*

def answer_six():
    columns_to_keep = ['SUMLEV','STNAME','CTYNAME','CENSUS2010POP']
    six_df = census_df[columns_to_keep]
    
    six_df=six_df[six_df['SUMLEV'] == 50]
    six_df.drop('SUMLEV',inplace=True,axis=1)

    unique_states = six_df['STNAME'].unique()
    copy_df = six_df.set_index('STNAME')
    
    max_count = 0
    max_state = None
    population_df = pd.DataFrame()
    for st in unique_states:
        state_population = 0
        state_df = copy_df.loc[st]
        if isinstance(state_df, pd.DataFrame):
            state_df = state_df.sort_values(by='CENSUS2010POP', ascending=False)
        
            county_len = len(state_df)

            if county_len > 3:
                county_len = 3

            for i in range(county_len):
                state_population += (state_df.iloc[i])['CENSUS2010POP']
        else:
            state_population = state_df.loc['CENSUS2010POP']
            
        population_df = population_df.append(pd.Series(data={'CENSUS2010POP': state_population}, name=(st)))
    
    population_df = population_df.sort_values(by='CENSUS2010POP', ascending=False)
    
    ans_six = []
    state_len = len(population_df)
    if state_len > 3:
        state_len = 3
    for i in range(state_len):
        ans_six.append((population_df.iloc[i]).name)
    return ans_six

answer_six()
['California', 'Texas', 'Illinois']


### Question 7 
Which county has had the largest absolute change in population within the period 2010-2015? (Hint: population values are stored in columns POPESTIMATE2010 through POPESTIMATE2015, you need to consider all six columns.) 

e.g. If County Population in the 5 year period is 100, 120, 80, 105, 100, 130, then its largest change in the period would be |130-80| = 50. *This function should return a single string value.*

def answer_seven():
    census_df["population_max"] = census_df[["POPESTIMATE2010", "POPESTIMATE2011", "POPESTIMATE2012", "POPESTIMATE2013", "POPESTIMATE2014", "POPESTIMATE2015"]].max(axis=1)
    census_df["population_min"] = census_df[["POPESTIMATE2010", "POPESTIMATE2011", "POPESTIMATE2012", "POPESTIMATE2013", "POPESTIMATE2014", "POPESTIMATE2015"]].min(axis=1)
    columns_to_keep = ['SUMLEV','STNAME','CTYNAME','population_max','population_min']
    seven_df = census_df[columns_to_keep]
    
    seven_df=seven_df[seven_df['SUMLEV'] == 50]
    seven_df.drop('SUMLEV',inplace=True,axis=1)
    
    seven_df['population_diff'] = abs(seven_df['population_max'] - seven_df['population_min'])
    population_max = ((seven_df.sort_values(by='population_diff', ascending=False)).iloc[0])['CTYNAME']

    return population_max

answer_seven()
'Harris County'


### Question 8 
In this datafile, the United States is broken up into four regions using the "REGION" column. Create a query that finds the counties that belong to regions 1 or 2, whose name starts with 'Washington', and whose POPESTIMATE2015 was greater than their POPESTIMATE 2014. 

*This function should return a 5x2 DataFrame with the columns = ['STNAME', 'CTYNAME'] and the same index ID as the census_df (sorted ascending by index).*

def answer_eight():
    copy_df = census_df.copy()
    
    copy_df=copy_df[(copy_df['REGION'] == 1) | (copy_df['REGION'] == 2)]
    columns_to_keep = ['STNAME','CTYNAME','POPESTIMATE2015','POPESTIMATE2014']
    eight_df = copy_df[columns_to_keep]
    
    eight_df=eight_df[(eight_df['POPESTIMATE2015'] > copy_df['POPESTIMATE2014'])]
    eight_df=eight_df[eight_df['CTYNAME'].str.contains('Washington')]
    
    eight_df.drop('POPESTIMATE2015',inplace=True,axis=1)
    eight_df.drop('POPESTIMATE2014',inplace=True,axis=1)
    return eight_df

answer_eight()

STNAMECTYNAME
896IowaWashington County
1419MinnesotaWashington County
2345PennsylvaniaWashington County
2355Rhode IslandWashington County
3163WisconsinWashington County















Comments