Graph Problems

Groups:

Group 1 (xena01-03) - Burnett, Jesse; Whitten, Marcus; Reyes, Miguel

Group 2 (xena04-06) - Newton, Michael; Ang, Sam; Croxton, John

Group 3 (xena07-09) - Skogman, Brett; Fordin, Sarah; Viltoft, Jorgen

Group 4 (xena10-12) - Holloway, Taylor; Walker, Blair; Usiri, Calvin

Group 5 (xena13-15) - Andres, Robbie; Samoray, Nicholas; Witecki, Ian

Group 6 (xena16-18) - Koeller, Jordan; Chang, Stephen; Burton, Craig

Group 7 (xena19-21) - Yang, Mary; Herbert, Emily; Bomer, Dan

Data Set:

This week's data set comes from the MEDLINE data set. This is a data set that tracks publications in medicine and associated biology. The data is stored in a set of XML files. These XML files are in /data/BigData/Medline/. These files have a large number of entries with the tag "MedlineCitation". Each citation has associated with it "DescriptorName" entries that list the words for topics that are significant for it. These can be labeled as being major topics or not with the "@MajorTopicYN" annotation. For this week, you be looking at pairs of terms that are descriptive of different articles. You will do this by making a graph of the terms that each citation is described by. The vertices are the descriptor names, and the edges connect terms that appear on the same citation. Using the combinations method on collections to get all the pairs. I will have you work with two different graphs. One that includes only major topics, and one that includes all topics.

In Class Questions:

1. How many distinct descriptor names are there?

2. What are the 10 most common descriptor names?

3. What if you restrict it to only major topics?

4. Theoretically, how many pairs of descriptor names could there be? (This is just a math question based on #1.)

5. How many actual pairs are there? (This is how many edges you will have in your graph for the out of class work.)

6. Should a graph of term pairs be directed? Why or why not? What does this imply for your graph?

Before you leave class, one member of your group needs to send me an email with your group answers to these questions and the code you wrote to solve them. Make sure the email also includes the names of all the group members who were present to work on this.

Between Class Questions:

All the code that you write to answer these questions should be put in a package called sparkgraphx in the in-class repository. You should also make a file called sparkgraphx.md in the top level of your repository that includes a write-up with your answers to the questions and any requested plots.

For each of the following, give two answers. One answer is for that graph with only major topics, and the other is for the graph with all topics. In each case, I want you to remove the 10 most common terms from the graph.

1. How many connected components does the graph have? How big are they?

2. What are the top words by page rank? (Give the terms and their ranks.)

3. Make a histogram plot of the degree distribution.

4. How far apart are the following terms in your graph? (Using the shortest path.)

a. Pregnancy; Esophagus

b. Femoral Artery; Electroencephalography

c. Taxes; Guinea Pigs