We have some properly formatted data in the data frame "samples"
We are aware that Funk and its ilk is the most represented sampled genre
Condition our data a bit, and explore the data a bit further with some quick plotting
After our simple pie chart, its time to investigate a bit further with the qplot() function, allowing us to make some simple plots to see some clear results. First, using our structure "samples" and the summary() function from Part 1, you can see that the year of original publishing is stored as a factor, and includes unknowns as "?".
>
Sampled.Genre Sampled.Artist
Soul,Funk,R&B :3761 Honey Drippers : 117
Rock/Pop :1068 Collins, Lyn : 114
Jazz : 898 Bliss, Melvin : 94
Old School Rap: 173 Clinton, George: 82
Reggae : 90 James, Bob : 82
Soundtracks : 88 J. B.'s, The : 69
(Other) : 388 (Other) :5908
Sampled.Song Sampled.Album
Innocent 'till Proven Guilty: 118 single : 671
Think (About It) : 112 ? : 258
Birth : 95 Think (About It): 114
Atomic Dog : 80 Computer Games : 82
Nautilus : 62 One : 82
Easier to Love : 61 Food for Thought: 69
(Other) :5938 (Other) :5190
Sampled.Label Sampled.Publishing.Date Sampling.Artist
? : 277 1973 : 647 Beastie Boys : 102
Capitol : 227 1974 : 514 De la Soul : 99
Warner Bros: 220 1972 : 486 Public Enemy : 78
Atlantic : 210 1970 : 403 Eric B and Rakim: 68
People : 207 1971 : 397 Ice Cube : 66
Columbia : 151 1969 : 392 Big Daddy Kane : 63
(Other) :5174 (Other):3627 (Other) :5990
Sampling.Song
? : 45
In/Flux : 10
Move the Crowd : 10
The Number Song : 10
Jackin' for Beats: 9
Shake Your Rump : 9
(Other) :6373
This limits the ways we can display and analyze the data. R has a Date class for calendar dates, but it's not really easy to use, and we have just years and not a full calendar date. Instead, converting to a numeric will be sufficient, as well as eliminating the entries with unknown years. We can quickly eliminate the unknown entries and save the resulting subset with the subset() function:
> definite.dates <- subset(samples,Sampled.Publishing.Date!="?")
Then, we can set the class for the publishing date to a numeric from a factor. First, to preserve the dates, we have to convert to the character class, and then into a numeric:
> definite.dates$Sampled.Publishing.Date = as.numeric(as.character(definite.dates$Sampled.Publishing.Date))
Now to the graphing!
As the two smallest and most discreet information available currently are our newly numeric publishing year and genre, we can begin by creating a scatterplot of these two categories. However, because of the data's density, using the "jitter" geom and lowering the alpha can make our overlapping data points a little more readable:
> qplot(x=Sampled.Publishing.Date, y=Sampled.Genre, data=definite.dates, color=Sampled.Genre, geom ="jitter", alpha = I(1/2))
We can right away see the Soul, Funk, and R&B predominance, and the years of original publishing matches up to times of their popularity and large scale release. Sampled Rap and Comedy also seems to be strongly associated with a few years of large releases post 'Rapper's Delight' in 1979, and Comedy with Eddie Murphy and Richard Pryor's big mid-1970's hits.
We can inspect these figures in more of a global context by creating a histogram, allowing us to compare genres to each other directly, as well as the whole population:
> qplot(Sampled.Publishing.Date, data=definite.dates, color=Sampled.Genre, fill=Sampled.Genre, geom ="histogram",binwidth=2)
The late 1970s peak is even more apparent in this graph, and you can observe that this is greatly contributed to by a large amount of Jazz samples as well.
Unfortunately due to the nature of this data, it is difficult to portray much in the way of global information- a lot of the information, such as songs and artists, is extremely granular. Figuring out how to summarize and analyze that data would be an important step. In addition, if more data could be located about the sampling artists and songs, such as their publication dates and popularity, that would allow us to start drawing meaningful conclusions from this data set.