Working with Text Data

So far, we let pandas decide the types of our data. When it detects number, it automatically converts it into float, while others are converted into 'object'. We will see how to pre-define the datatypes and how to work with text data.

Music by: bensound.com

Text Data Types

It is recommended to define the data types before reading the data. By using dtype options when reading csv or table, we can appoint the appropriate type for each column.

String Methods

Series (one column object) and Index can process string easily without having to go through each of its element. Some basic string methods including:

str.lower() to convert into lowercase
str.upper() to convert into uppercase
str.len() to get the length of string
str.strip() to get rid of whitespace

Splitting and Replacing Strings

Method str.split() can be used to split a string into a list. You can use the option expand=True to generate a DataFrame.

Method str.replace() can be used to replace a string with another. By default, it replace regular expressions.

Indexing with .str

The notation [] can be used to directly index by position locations. If you index past the end of the string, the result will be NaN.

Extracting substrings

The str.extract() method accepts a regular expression with at least one capture group.

Extracting a regular expression with more than one group returns a DataFrame with one column per group.

Creating indicator variables

While there's no problem when a data contain one categorical value, it can be difficult to handle when one cell contains more than one of these values.

In our subjects.csv, each students have multiple subjects. We will convert the subject name into dummy variables using str.get_dummies(), then concat to join them together.

< Prev. Lesson

Next Lesson >

Exercise 5.4

You got a request from a kindergarten to make a list of the kids favorite fruits. The kindergarten teacher only give you the txt file. Convert the txt file into a DataFrame consisting of name, age, sex (use M/F), and after that followed by fruit list in dummy variable form. At the end, use sum() to see the how many students like each kind of fruits.

Page updated

Google Sites

Report abuse