dates histogram

#advanced

Time flies, wether you plot it or not

Histograms are a simple way to get an assessment on data, that's well known.

Well, when such data is formatted in a way that the values has a temporal ordering, things gets not so straightforward. Here I'll show you a way to quickly create histograms from dates, grouping in different scales (month, days, and others).

The steps can be summarized in:

    1. Constructing a pandas dataframe;
    2. Applying aggregation routines (one line of code!)
    3. Plotting the aggregated result

Let's assume we have a dataframe with 4 columns. Here we print the last rows:

Sometimes we'll need to adjust the decoding of the date formating, in case is not in 'utf-8', but in 'byte'. How to detect it? if we see b'2019-02-10', then the column was not properly decoded when our dataframe was generated.

The decoding can be done by simply:

Then, we use 3 pandas methods:

  • groupby(): to group rows by the argument given to this method. Here we can group by year, month, day, and other units contained in out format.
  • count(): to get the number of occurrences per group. There are also other metrics, in case you want other statistics.
  • plot(): to use the shortcut to directly plot from pandas, internally communicated with matplotlib.

These 3 methods are used in the below block of code, generating colorful histograms out of the box.

And here we have our histograms, in various flavors for different tastes:

As you see, this was pretty simple, without need to format the data and transform to datetime. Pandas does the heavy work for us, allowing to quickly create simple visualizations with few lines of code. Time saving isn't?

Hope you found it useful!

You can find the complete Jupyter notebook on my Github