# 06 Grouping and Aggregation in R

For illustration we are going we ggplot2::diamonds dataset.

Load diamonds data and look at the data structure

``> diamonds = ggplot2::diamonds``
``> str(diamonds)``

Find average carat (weight) for each cut

``> aggregate(carat ~ cut, diamonds, mean)``
``  diamonds\$cut diamonds\$carat``
``1         Fair      1.0461366``
``2         Good      0.8491847``
``3    Very Good      0.8063814``
``4      Premium      0.8919549``
``5        Ideal      0.7028370``

Find average of carat and mean price for each cut.

``> aggregate(cbind(carat, price) ~ cut, diamonds, mean)``
``        cut     carat    price``
``1      Fair 1.0461366 4358.758``
``2      Good 0.8491847 3928.864``
``3 Very Good 0.8063814 3981.760``
``4   Premium 0.8919549 4584.258``
``5     Ideal 0.7028370 3457.542``

You can also find mean of two columns that are grouped by 2 columns.

``> aggregate(cbind(carat, price) ~ cut + color, diamonds, mean)``
``         cut color     carat    price``
``1       Fair     D 0.9201227 4291.061``
``2       Good     D 0.7445166 3405.382``
``3  Very Good     D 0.6964243 3470.467``
``4    Premium     D 0.7215471 3631.293``
``5      Ideal     D 0.5657657 2629.095``
``6       Fair     E 0.8566071 3682.312``
``7       Good     E 0.7451340 3423.644``
``...``

For doing multiple aggregation at the same time plyr package can useful

``> require(plyr)``
``> ddply(diamonds, .(cut, color), function(w) c(count = nrow(w), avgPrice = mean(w\$price), maxCarat = max(w\$carat)))``
``         cut color count avgPrice maxCarat``
``1       Fair     D   163 4291.061     3.40``
``2       Fair     E   224 3682.312     2.04``
``3       Fair     F   312 3827.003     2.58``
``4       Fair     G   314 4239.255     2.60``
``5       Fair     H   303 5135.683     4.13``
``6       Fair     I   175 4685.446     3.02``
``7       Fair     J   119 4975.655     5.01``
``8       Good     D   662 3405.382     2.04``
``9       Good     E   933 3423.644     3.00``
``...``

If you are already familiar with sql statement, sqldf package will be useful

``require(sqldf)``
``> sqldf("select cut, color, count(*), avg(price) avgPrice, max(carat) maxCarat from diamonds group by cut, color")``
``         cut color count(*) avgPrice maxCarat``
``1       Fair     D      163 4291.061     3.40``
``2       Fair     E      224 3682.312     2.04``
``3       Fair     F      312 3827.003     2.58``
``4       Fair     G      314 4239.255     2.60``
``5       Fair     H      303 5135.683     4.13``
``6       Fair     I      175 4685.446     3.02``
``...``

Order diamonds dataset based on price

``> diamonds = diamonds[order(diamonds\$price, decreasing = TRUE),]``