Part 1- Data Collection, Pie Charts

Data Selection and Collection:

The first step in this project is to find and use a sampling database, of which there are several. Of these, I selected the-breaks.com, as it not only has some of the most diverse samples from different genres available, but it also has a simple HTML browsing display. As such, scrubbing the site using Perl and the HTML::TokeParser and LWP::Simple modules would be relatively easy. The Perl script is as follows, and can also be found as an attachment at the bottom of the page, along with the dataset and R script.

#we're using tokeparser to browse by tag, and LWP::Simple to pass in the sites

#simply as text- this is important because of weird difficulties with html entities,

#as seen later.

use HTML::TokeParser;

use LWP::Simple;

#open the external text file to save all this info to- later, once we have a

#better picture of what we will be using/saving, we can build directly to a mySQL

#database.

open (OUT, '>sampledatabase.txt');

#these binmode commands set handling encoding of text to UTF-8, as occasional

#UTF-8 characters are found in the source text

binmode DATA, ":utf8";

binmode OUT, ":utf8";

#printing column headers

print OUT "Sampled Genre\tSampled Artist\tSampled Song\tSampled Album\tSampled Label...

\tSampled Publishing Date\tSampling Artist\tSampling Song\n";

#steps througheach genre and each genre's pages (A through Z)

foreach $genre (1 .. 20){

foreach $page (A .. Z){

#a polite sleep so that the site doesn't hate me constantly pulling pages

#and database searches

sleep (1);

#sets and reads the current url

$URL = "http://the-breaks.com/perl/full.pl?genre=$genre&page=$page";

$rawpage = get($URL);

#this replaces the html entities in the current page- there should have been

#several cleaner ways of doing this in perl, but I couldn;t get any of them to

#actually work, so regex to the rescue!

$rawpage =~ s/ / /g;

$rawpage =~ s/&\#146;/\'/g;

$rawpage =~ s/(&\#147;|&\#148;)/\"/g;

#initializes the parser

my $tokeparser = HTML::TokeParser->new(\$rawpage);

#finds sampled artist's name, exits if we've gotten to the end of the document

#where the bold tags are on the page numbers

while (my $ogartist = $tokeparser->get_tag('b')){

my $artist = $tokeparser->get_text();

next if ($artist =~ /\d+/);

#finds from all possible starting tags either the album info,

#the original song, or the song that sampled it. then it uses

#regex to parse out each possibility and save it into it's

#respective variable

while (my $songs = $tokeparser->get_tag('i','br','/center')){

my $rawsong = $tokeparser->get_trimmed_text('br');

last if ($rawsong =~ /^(By Letter.*|\d+)?$/);

if ($rawsong =~ /(.+?): $(.+) ([\d\?]+)$$/){

($album,$label,$publishdate) = ($1,$2,$3);

}

elsif($rawsong =~ /^\* "(.+)"$/){

$ogsong = $1;

}

elsif($rawsong =~ /^(.+?)\'s \"(.+?)\"$/){

($newartist,$newsong) = ($1,$2);

#for each sampling song, it prints a line of all the sampling and sampled song's

#information.

print OUT "$genre\t$artist\t$ogsong\t$album\t$label\t$publishdate\t$newartist\t$newsong\n";

}

After this, we're left with a nice and neat tab-delimited text file, and we can begin doing some simple data exploration in R, and specifically within R, the ggplot2 plotting system, based on the grammar of graphics. First, we initialize the ggplot2 system, read in the data and store it in a table, an easy task as it was written to be easily parsed by the read.delim() command (NOTE: we also need to disable the quote parameter, as some of the various entries have odd characters that would trigger it)

> library(ggplot2)

> samples = read.delim("sampledatabase.txt", quote = "")

We can check the table's headers and data with a quick summary() call:

> summary(samples)

Sampled.Genre Sampled.Artist Sampled.Song

Min. : 1.000 Honey Drippers : 117 Innocent 'till Proven Guilty: 118

1st Qu.: 1.000 Collins, Lyn : 114 Think (About It) : 112

Median : 1.000 Bliss, Melvin : 94 Birth : 95

Mean : 2.246 Clinton, George: 82 Atomic Dog : 80

3rd Qu.: 3.000 James, Bob : 82 Nautilus : 62

Max. :20.000 J. B.'s, The : 69 Easier to Love : 61

(Other) :4895 (Other) :4925

Sampled.Album Sampled.Label Sampled.Publishing.Date

single : 618 People : 207 1973 : 539

? : 193 Capitol : 203 1974 : 444

Think (About It): 114 ? : 196 1972 : 419

Computer Games : 82 Atlantic : 156 1970 : 365

One : 82 Warner Bros: 148 1971 : 345

Food for Thought: 69 RCA : 131 1969 : 331

(Other) :4295 (Other) :4412 (Other):3010

Sampling.Artist Sampling.Song

Beastie Boys : 90 ? : 36

De la Soul : 85 The Number Song: 10

Public Enemy : 67 Shake Your Rump: 9

Ice Cube : 59 Intro : 8

Eric B and Rakim: 58 Move the Crowd : 8

Big Daddy Kane : 56 Hey Ladies : 7

(Other) :5038 (Other) :5375

This looks about right- the question marks are the representation of missing information in the database, which as you can see is rather common. It also seems that R has interpreted every column as a factor except for genre, which was represented as a number in the database. To fix this by both replacing the numbers with their appropriate labels and making them into factors, we can use the cut() function. First, I checked the original site and found the labels for each number, and then created a concatenated list of them in order:

> genres = c("Soul,Funk,R&B", "Jazz", "Rock/Pop", "Blues", "Reggae", "Old School Rap", "Comedy", "Soundtracks", "Easy List.", "Electronica", "Childrens", "Latin", "World", "Classical", "Country", "Gospel", "Novelty", "Comps", "Library", "Various")

Then, I used the cut() function on the Sampled.Genre column using this list, and assigning it back to the column:

> samples$Sampled.Genre = cut(samples$Sampled.Genre,20,genres)

Let's check it quickly again with summary():

> summary(samples)

Sampled.Genre Sampled.Artist

Soul,Funk,R&B :3761 Honey Drippers : 117

Rock/Pop :1068 Collins, Lyn : 114

Jazz : 898 Bliss, Melvin : 94

Old School Rap: 173 Clinton, George: 82

Reggae : 90 James, Bob : 82

Soundtracks : 88 J. B.'s, The : 69

(Other) : 388 (Other) :5908

Sampled.Song Sampled.Album

Innocent 'till Proven Guilty: 118 single : 671

Think (About It) : 112 ? : 258

Birth : 95 Think (About It): 114

Atomic Dog : 80 Computer Games : 82

Nautilus : 62 One : 82

Easier to Love : 61 Food for Thought: 69

(Other) :5938 (Other) :5190

Sampled.Label Sampled.Publishing.Date Sampling.Artist

? : 277 1973 : 647 Beastie Boys : 102

Capitol : 227 1974 : 514 De la Soul : 99

Warner Bros: 220 1972 : 486 Public Enemy : 78

Atlantic : 210 1970 : 403 Eric B and Rakim: 68

People : 207 1971 : 397 Ice Cube : 66

Columbia : 151 1969 : 392 Big Daddy Kane : 63

(Other) :5174 (Other):3627 (Other) :5990

Sampling.Song

? : 45

In/Flux : 10

Move the Crowd : 10

The Number Song : 10

Jackin' for Beats: 9

Shake Your Rump : 9

(Other) :6373

Graphs:

Let's take a look at how this works out in a pie chart by genre:

> ggplot(samples, aes(x = factor(1), fill = samples$Sampled.Genre)) + geom_bar(width = 1) + coord_polar(theta = "y")

Our resulting graph is as follows, and quickly shows us the mighty influence of Soul, Jazz, and Pop on hip hop: