Lesson 2 ❮ Lesson List ❮ Top Page
❯ 2.3 Aggregation
⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺⎺
EXPECTED COMPLETION TIME
❲▹❳ Video 9m 36s
☷ Interactive readings 5m
We will showcase the flexibility of the groupby DataFrame method by answering the following queries:
Finding the number of canceled flights for every airline per weekday
For each origin and destination, finding the total number of flights, the number and percentage of canceled flights, and the average and variance of the airtime.
Read in the flights dataset, and we define:
grouping columns : AIRLINE
aggregating columns : ARR_DELAY
aggregating functions : mean
Place the grouping column in the groupby method and then call the agg method with a dictionary pairing the aggregating column with its aggregating function
Now, read in the flights dataset, and answer the first question by defining
grouping columns: AIRLINE, WEEKDAY
aggregating column: CANCELLED
aggregating function:
sum
You can answer the second query by defining:
grouping columns: ORG_AIR, DEST_AIR
aggregating column: CANCELLED, AIR_TIME
aggregating function:
CANCELLED:
sum, mean
AIR_TIME:
mean, var
To do this easier, we use a dictionary in the agg method to map specific aggregating columns to specific aggregating functions.
Next questions:
What is the total of long-haul flight (more than 3 hours) for each airline?
What is the square root of the total of the delayed flight for each airline?
While pandas don't provide the exact aggregation function for that, we can define the function ourselves those function.
For no. 1,
grouping column: AIRLINE
aggregating column: AIR_TIME
aggregating function:
CANCELLED:
sum but only for values with >180
For no. 2,
grouping column: AIRLINE
aggregating column: AIR_TIME
aggregating function:
ARR_DELAY:
len but squared