Search this site
Embedded Files
AIMD GPDS Courses
  • Home
  • Courses
  • Contact
AIMD GPDS Courses
  • Home
  • Courses
  • Contact
  • More
    • Home
    • Courses
    • Contact

日本語  ❯

Lesson 2    ❮    Lesson List    ❮    Top Page

2.1  Filtering with Boolean

2.2  Grouping Object

❯  2.3  Aggregation

2.4  Concatenate, Join, Merge

⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺
EXPECTED COMPLETION TIME
❲▹❳  Video   9m 36s
☷  Interactive readings   5m

We will showcase the flexibility of the groupby DataFrame method by answering the following queries:

  1. Finding the number of canceled flights for every airline per weekday

  2. For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and the average and variance of the airtime.

Using Aggregation Function on GroupBy object

Read in the flights dataset, and we define:

  •  grouping columns : AIRLINE

  • aggregating columns : ARR_DELAY

  • aggregating functions : mean

Place the grouping column in the groupby method and then call the agg method with a dictionary pairing the aggregating column with its aggregating function

Aggregating Multiple Columns

Now, read in the flights dataset, and answer the first question by defining

  • grouping columns: AIRLINE, WEEKDAY

  • aggregating column: CANCELLED

  • aggregating function:
    sum

Aggregating with Multiple Functions

You can answer the second query by defining:

  • grouping columns: ORG_AIR, DEST_AIR

  • aggregating column: CANCELLED, AIR_TIME

  • aggregating function:

    • CANCELLED:
      sum, mean

    • AIR_TIME:
      mean, var

To do this easier, we use a dictionary in the agg method to map specific aggregating columns to specific aggregating functions.

Aggregating with User-defined Function

Next questions:

  1. What is the total of long-haul flight (more than 3 hours) for each airline?

  2. What is the square root of the total of the delayed flight  for each airline?

While pandas don't provide the exact aggregation function for that, we can define the function ourselves those function.

For no. 1,

  • grouping column: AIRLINE

  • aggregating column: AIR_TIME

  • aggregating function:

    • CANCELLED:
      sum but only for values with >180

For no. 2,

  • grouping column: AIRLINE

  • aggregating column: AIR_TIME

  • aggregating function:

    • ARR_DELAY:
      len but squared

©2023. All rights reserved.  Samy Baladram,
Graduate Program in Data Science - GSIS - Tohoku University
Google Sites
Report abuse
Page details
Page updated
Google Sites
Report abuse