pandas is a powerful data analysis library with a rich API that offers multiple ways to perform any given data manipulation task. Some of these approaches are better than others, and pandas users often learn suboptimal coding practices that become their default workflows. This post highlights four common pandas anti-patterns and outlines a complementary set of techniques that you should use instead.*

For illustrative examples of good and bad pandas patterns, I'm using this Netflix dataset from Kaggle, which characterises almost 6,000 Netflix shows and movies with respect to 15 features spanning various data types.

Most pandas practitioners first learn data processing with pandas by sequentially mutating DataFrames as a series of distinct, line-by-line operations. There are a few reasons why excessive mutation of pandas DataFrames can cause problems:

Chaining transforms a DataFrame according to a multi-step procedure all at once. This guarantees the full and proper application of each pandas method, thus mitigating the risk of bugs. The code is more readable with each line cleanly representing a distinct operation (note: many Python code formatters will destroy this structure - wrap your pandas code blocks with '#fmt: off' and '#fmt: on' to prevent this). Chaining will feel natural for R users familiar with the magrittr %>% operator.

Occasionally, you'll need to perform complex data manipulation processes that can't be cleanly implemented using off-the-shelf chaining methods. This is where pandas' .pipe can be used to abstract away complex DataFrame transformations into separately defined functions.

The use of for loops in pandas is a code smell that should always be eliminated. This includes pandas' built-in generator methods DataFrame.iterrows() and DataFrame.itertuples(). There are two reasons to avoid looping in pandas:

Once pandas practitioners learn about .apply, they often end up applying it everywhere. This isn't always a problem, as the .apply approach produces coherent code and performs adequately with modestly-sized datasets.

Optimising the data types for each column in a pandas DataFrame will improve performance and memory usage. When working with large datasets, significant gains can be made by shrinking the default float64 and int64 data types to smaller equivalents, such as float16 and int8, for columns where this doesn't result in data loss.

However, the most egregious data type mismatch worth eliminating from your pandas code is using strings instead of categoricals. Converting a low cardinality column of categorical data from its default object type to a category type often achieves memory usage improvements of 100x and computation speed ups of 10x. The code sample below demonstrates how this conversion can be performed within a chained workflow.

In this article, I've shown four pandas anti-patterns, and alternative approaches you should adopt instead. The code sample below illustrates how these best practices can be combined into a coherent workflow. This particular example shows how we can calculate the mean adjusted score of the shows, depending on the prevalence rank of the first production country.

Adopting these practices allows for the complex data transformations and processing to all be conducted in a single chained statement. The code is performant, readable, and simple to maintain and extend. If you're not already coding pandas in this way, I recommend giving it a try!

Starting from pandas 1.0, an experimental pd.NA value (singleton) is available to represent scalar missing values. At this moment, it is used in the nullable integer, boolean and dedicated string data types as the missing value indicator.

Note the capital 'F' to distinguish from np.float32 or np.float64, also note string which is the new pandas StringDtype (from Pandas 1.0) and not str or object.Also pd.Int64 (from pandas 0.24) nullable integer capital 'I' and not np.int64.

