Lecture 5

Here is the twitter data that we will discuss in class. You need to be signed into your @UCSC.EDU account in order to access it. The data is roughly 500MB, compressed, and I guess some 2GB when uncompressed; it is divided into 24 files, each representing 1% of the twits that occurred during one hour. The data consists of many lines, each of which is a json record; you can easily read it e.g. using Python (and many other languages work).

Here is a link to ipython, which we will use for our reputation system experiments.

Homework Assignment 2

This is due one week from now, on Tuesday January 31. Please do the assignment as a Google Doc (you can easily include images, pdf, etc in a doc), so we can then post the links to all these docs, and we can look at what we have done, the ideas we had, and can discuss this in class. We will work on improving the reputation system, and do more measurements, but having some results in one week will help get us up to speed.

Problem 1

Consider the Pagerank computation, which can be written symbolically as

x' = cPx + (1-c)/N * 1.

Assume that instead of the stochastic matrix P, we use P_alpha = alpha * P + (1 - alpha) * I, where alpha \in [0, 1] is a scalar, and I is the identity matrix. How should we change c (that is, which c_alpha should we use), so that the PageRank computation is not affected?

Problem 2

Look at the twitter data, and using ipython:

Plot the distribution of times that each user posts.
Plot the distribution of followers -- the distribution of how many followers users have.
Is there a correlation between number of followers and number of posts / day? Can you measure it?
Describe and implement a simple reputation system for Twitter users, and:
1. Plot the distribution of ranks.
2. List the top 40 usernames.

Page updated

Report abuse