Part 3 - Factorization for Analysis

Where We Are:

- We have some properly formatted data in the data frame "samples"
- We are aware that Funk and its ilk is the most represented sampled genre
- We used some quick plots to identify important trends and seminal periods of work

What To Do:

- Use factorization to examine data for more detailed trends and analysis

Analysis:

One of the next major steps in analysis is the development of factorization- as in, the analysis of data by creating defined categories for the data. To a minor extent, we did this when we used a histogram in Part 2, which binned our date of publication data by every 2 years.

One of the most striking features of our data is the simple fact that the "Soul,Funk,R&B" genre dominates our dataset. While this category is certainly well-represented in sampling, it also seems that one of its influences is the fact that it is actually made up of actually 4 popular separate genres (disco had been folded in as well, although not explicitly stated), unlike most of the other genres. In fact, the other multiple-genre genre is Rock/Pop, which is also quite large.

In a perfect world, we would be able to separate out these three genres, or even re-categorize the samples by another genre-defining feature, like tempo or BPM. However, that is quite difficult, as there are few easily accessible and reliable music databases to which we could cross-reference. However, it is important to consider possibly investigating such a cross-reference, especially for our sampling songs, as this database contains far too little information on them, and as our central purpose of this investigation is to try and understand the relationship between sampling and sampled songs, this would seem quite prudent.

So what are we left with? Why, creating our very own multiple-genre genres! To do so, we'll have to make some quick and dirty decisions of what should be categorized together that would likely make a Music major cry, but boo on them anyway. We can start by defining characteristics of tempo, rhythm and performing instruments, and try and reduce our original 20 categories to a more reasonable 10. Let's see how the breakdown should go, and why:

To perform our factorization, we can't use our handy cut() function, because it is designed to break up adjacent levels or numerics. Instead, we have to reassign each level that we want to re-categorize to it's new Super Genre manually by calling the levels with levels(), by calling and assigning each genre in a list. This is annoying to the extreme, but there you go. From Part 1, our assigned categories were as follows:

> genres = c("Soul,Funk,R&B", "Jazz", "Rock/Pop", "Blues", "Reggae", "Old School Rap", "Comedy", "Soundtracks", "Easy List.", "Electronica", "Childrens", "Latin", "World", "Classical", "Country", "Gospel", "Novelty", "Comps", "Library", "Various")

So, let's create a new list variable:

> supergenres = c("Funk and Soul(Soul,Funk,R&B)","Jazz and Blues(Jazz,Blues)","Pop Music(Rock/Pop)","Jazz and Blues(Jazz,Blues)","Raps over Beats(Reggae,Rap)","Raps over Beats(Reggae,Rap)","Spoken Word(Comedy)","Orchestral(Soundtracks,Classical,Easy List.)","Orchestral(Soundtracks,Classical,Easy List.)","Electronica(Electronica)","Soulful Choruses and Soloists(Childrens,Gospel,Country)","World Music(Latin,World)","World Music(Latin,World)","Orchestral(Soundtracks,Classical,Easy List.)","Soulful Choruses and Soloists(Childrens,Gospel,Country)","Soulful Choruses and Soloists(Childrens,Gospel,Country)","Miscellaneous(Comps,Library,Various,Novelty)","Miscellaneous(Comps,Library,Various,Novelty)",

"Miscellaneous(Comps,Library,Various,Novelty)","Miscellaneous(Comps,Library,Various,Novelty)")

And then, let's create a new data frame with our definite.dates subset that we made in Part 2, check it with summary(), and assign our new factors with levels():

> factorized <- definite.dates

> summary(factorized)

Sampled.Genre Sampled.Artist

Soul,Funk,R&B :3677 Honey Drippers : 117

Rock/Pop :1008 Collins, Lyn : 114

Jazz : 855 Bliss, Melvin : 94

Old School Rap: 152 Clinton, George: 82

Soundtracks : 74 James, Bob : 82

Reggae : 70 J. B.'s, The : 69

(Other) : 288 (Other) :5566

Sampled.Song Sampled.Album

Innocent 'till Proven Guilty: 118 single : 649

Think (About It) : 112 Think (About It): 114

Birth : 95 Computer Games : 82

Atomic Dog : 80 One : 82

Nautilus : 62 ? : 69

Easier to Love : 61 Food for Thought: 69

(Other) :5596 (Other) :5059

Sampled.Label Sampled.Publishing.Date Sampling.Artist

Warner Bros: 220 Min. :1928 Beastie Boys : 95

Capitol : 218 1st Qu.:1970 De la Soul : 87

People : 207 Median :1973 Public Enemy : 74

Atlantic : 203 Mean :1974 Ice Cube : 65

Columbia : 149 3rd Qu.:1978 Eric B and Rakim: 62

RCA : 145 Max. :2005 Big Daddy Kane : 61

(Other) :4982 (Other) :5680

Sampling.Song

? : 44

In/Flux : 9

Jackin' for Beats: 9

Move the Crowd : 9

Shake Your Rump : 9

Fight the Power : 8

(Other) :6036

> levels(definite.dates$Sampled.Genre) <- supergenres

Let's give it a quick check with summary():

> summary(definite.dates$Sampled.Genre)

Funk and Soul(Soul,Funk,R&B)

3677

Jazz and Blues(Jazz,Blues)

911

Pop Music(Rock/Pop)

1008

Raps over Beats(Reggae,Rap)

222

Spoken Word(Comedy)

43

Orchestral(Soundtracks,Classical,Easy List.)

123

Electronica(Electronica)

31

Soulful Choruses and Soloists(Childrens,Gospel,Country)

21

World Music(Latin,World)

43

Miscellaneous(Comps,Library,Various,Novelty)

45

Looks good!

Graphs:

Now, let's see what kind of effect this new factorization has had on our data. We can repeat the quick plots we did in Part 2 with our new data:

> qplot(x=Sampled.Publishing.Date, y=Sampled.Genre, data=factorized, color=Sampled.Genre, geom ="jitter", alpha = I(1/2))

With our much longer level names, we need to drag out the window a bit, but here's the result:

Interesting! Our new genres seem rather cohesive, and when you compare this graph to the Part 2 version, you can see that the grouped genres indeed seem similarly placed, although that wasn't considered in the creation of our new genres. Let's check the overall breakdown with a pie chart, a la Part 1:

> ggplot(factorized, aes(x = factor(1), fill = factorized$Sampled.Genre)) + geom_bar(width = 1) + coord_polar(theta = "y")

Again, a very similar look to our earlier graph. There's still a very dramatic Funk et al. influence, but we can see the newly created Jazz and Blues genre is now roughly equivalent to Pop Music.

Now, let's try the histogram, as in Part 2. As mentioned earlier, the histogram automatically creates bins for the publication years, thereby factorizing without having to make us mess about with the raw data too much. Let's give it a try using our original settings, and then try and increase the bin width a bit, and see how it changes our data:

> qplot(Sampled.Publishing.Date, data=factorized, color=Sampled.Genre, fill=Sampled.Genre, geom ="histogram",binwidth=2)

Again, a similar look to our earlier sets. Let's see what we can get if we bin for every 5 years instead:

> qplot(Sampled.Publishing.Date, data=factorized, color=Sampled.Genre, fill=Sampled.Genre, geom ="histogram",binwidth=5)

It seems that as we factorize, we are more dramatically exposing our initially observed trends. Also, note virtual elimination of Jazz and Blues after 1980. Our results seem to indicate that even with our factorization, Funk and related music is definitely heavily represented, showing that it is not simply and issue of an expanded genre. If we can cross-compare genres in a more concrete sense, perhaps we can identify the effects, especially in comparison to overall samples It would be excellent if we could find perhaps some figures regarding overall popularity of these genres as well. Also, it seems prudent that we try and investigate our sampling songs in a bit more detail, and see if we can determine a few further trends.