Please feel free to edit this page in any way you see fit to share information or announcements about the class!
Thanks Alex. I was wondering why my algorithm was performing verrryyyy poorly. :)
-- Rekha
ALL (Alex, Monday March 19 but nearly Tuesday):
I found a bug in the sampling program. It mutated histograms you passed in, by calling a .pop() on arrays around the median. I resolved this by just issuing a deepcopy at the beginning of the function so there would be a throwaway data structure. sampling.py is updated now, version 3.
This should only cause you problems if you do repeated sampling. I did 30 samples and had to figure out why my samples were dwindling to 0. The sample standard deviations were pretty hilarious.
--Alex
---------------
Ranjana: I'd heard the last class was going to be held from noon-3. I assume the room is going to be the same.
--Alex
When is our final class and where is it going to be held ?
Ranjana
---------------
I heard that some of you were confused by my evaluation code. In my code, the r of a pair (x, r) can be anything, e.g., a review id. In my case, I used r to contain a whole review object, which I had defined as follows:
class Review:
def __init__(self, data):
self.readable = str(data)
for k, v in data.iteritems():
if k == 'useful-count':
kk = 'useful_count'
else:
kk = k
setattr(self, kk, v)
if kk == 'date':
self.time = datetime.strptime(self.date.split()[-1],
'%m/%d/%Y')
def __str__(self):
return self.readable
def __repr__(self):
return self.readable
Important on evaluation (Luca, Monday March 19):
I put a new version of sampling.py: the previous version was flawed, since it could select more than 20 reviews (20 was interpreted as the weight of the reviews to be selected, rather than the total number of reviews).
Note that my code relies on a notion of hashing and equality for reviews; I form a set of reviews in the code to avoid sampling the same review more than once.
With these corrections, for my algorithm, the improvement on the baseline is about 40%.
Important on evaluation (Luca, Friday March 16):
A couple of things:
User bias:
While working on the problem, I notice that trusting the votes of users have voted a few number of time will not be good since there are not enough samples to judge his bias well. Users who have voted a large number of times reveal their bias better and therefore information about their bias could be more trusted to build the reputation system.
That given, my opinion is that giving the restaurants information about users who have reviewed in total of 1,2,3 and 4 times may be useless. However information about more frequent reviewers, implying that they frequent restaurants more often, their biases being more obvious, would be a more valuable resource to restaurants.
------------------------------
The following is my understanding of finding the bias of the user.
For each restaurant i, assume r(i) is a N(0,sigma^2) distribution with a large value of sigma(for greater spread). From the distribution of rating for restaurant i, tune sigma to get a final distribution r(i) for each restaurant i.
We need to find the bias of each user j. For this, we need to find , for each user j, q(k) for k between -5 to 5. q(k) = r ((vote of user j) - k). To infer from q(k) about the bias of user j, if his q(k) is zero for all except at k=0, he is an unbiased user. Otherwise he is biased based on for what values of k, q(k) is non zero.
Please comment.
Ranjana
-----------------------------
Hi Ranjana. The name of each file (without the json extension) is the Yelp's "alias" for the restaurant, which is a unique string that identifies the restaurant. If there is a number at the end, it typically means there are multiple restaurants with the same name.
To double check, you can always go to the Restaurant's page on Yelp, which has the url http://www.yelp.com/biz/alias_name. For example, http://www.yelp.com/biz/tacos-moreno-santa-cruz and http://www.yelp.com/biz/tacos-moreno-santa-cruz. In this case tacos-moreno is actually a Taco Shop on UCSC's main campus by Merrill College and tacos-moreno-2 is on Water Street.
Aryeh
Aryeh, That was really useful. I see that there are a few suffixed with 2,3 etc, Are they the reviews just split into two json files, or should we take the one with the higher number ? For eg: tacos-moreno-santa-cruz-2.json
tacos-moreno-santa-cruz.json
Ranjana
Yelp plus how many users found "useful"
I have updated the data one more time to reflect how many users found each post "useful", "funny", and "cool". The data are in the same format as last time (i.e. non-large-blob), which means that each file needs to be read in line-by-line like
for line in lines:
json.loads(line)
The data are in a file called "yelp_santa-cruz_588_feb11.zip", on the Yelp wiki page.
Yelp "Non-Large-Blob"
I am providing a "non-large-blob" version of the data, which means that instead of a file that is one large JSON, each file has small JSON blob on each line. This should make the data much easier to work with, and may be more efficient for larger files. The file is on the "Yelp Data" wiki page!
-Aryeh
Update on the Yelp Data
Hi all. I have re-written my scraping routine and have a new version of the data which should feature information about every restaurant listed in and around Santa Cruz, including several dozen restaurants in surrounding areas such as Capitola, Bonny Doon, and Scotts Valley. I have uploaded the file to the Yelp Data page and to this main wiki page. The data are in the same format as before: there is a "restaurants.json" file, which describes each restaurant, and 558 accompanying restaurant files. The new file is about 8 mb compressed and around 24 mb uncompressed.
Aryeh
Question on Yelp Data
I have a eaten a lot around Santa Cruz and there are many really good restaurants with good reviews and awards that are missing: Woodstock's pizza, Hindquarter, 515 Kitchen and cocktails, Crow's nest, Betty's burgers and a lot more...I checked and they are on yelp. So, why aren't they included in the list?
--Lenia
Complete Yelp Data
I have gathered and assimilated the Yelp information and have uploaded it to http://users.soe.ucsc.edu/~aryeh/yelp/yelp_data.zip. I can offer my source code at a later date. The data includes a README with suggestions about reading the data and describes some high-level aspects of the process used to gather the data.
Aryeh
Homework #2 Questions:
In case you were wondering, I asked Luca: the time zones of the tweets and the file names are all GMT. --Alex
Thanks. I spent many hours trying to settle my machine, but i could not. So, I will stick with the ubuntu, where everything works smoothly.
--Lenia
I will second Luca and recommend installing Matplotlib in an Ubuntu VM. (Actually, the Gnome text editor became faulty after I updated Ubuntu, so you might want to do Fedora instead.) I think I had Matplotlib installed in Snow Leopard, but cruft leftover from that installation has prevented me from getting it working again through what seem to be the most recommended methods on blogs and forums.
These Ubuntu packages should get your graphics up and running:
python-matplotlib python-scipy python-numpy ipython
--Alex
Has anyone managed to install Matplotlib on Mac OS-X Lion? Or, which other tool to plot results would you suggest?
--Lenia
In case Rekha's method does not work, you might try downloading the latest source from git (git clone https://github.com/matplotlib/matplotlib.git), then compile it (cd matplotlib; python setup.py build) and install it (sudo python setup.py install).
Aryeh
Yes. It creates errors because of missing freetype2 header files. Do the following before installing matlpotlib:
1. Download and install freetype2
2. copy the directory /usr/local/include/freetype2/freetype/ to /usr/local/include/freetype/
3. 'cp /usr/X11/include/ft2build.h /usr/local/include/ft2build.h'
Rekha
Notes from Alex on scraping
These are supplied a little late for the folks actually working with Yelp, but in case anybody else wants to work with scraping some other site in the future, these lines of Python may be helpful to you.
Form requests
Using Firebug in Firefox, which monitors website resource requests and performance among many other things, I found this is the general format for the request made when you submit a search.
http://www.yelp.com/search/snippet?attrs=&cflt=&find_desc=burritos&find_loc=santa+cruz,+ca&main_places=CA:Santa_Cruz::,CA:Capitola::,CA:Soquel::,CA:Felton::&mapsize=small&parent_request_id=728aa71759312a9d&rpp=10&show_filters=1&sortby=best_match&start=10
The relevant query-string portions of that HTTP GET request seem to be:
start=10
- offset in the results list; something is returned even if the request is made for an absurd list offset.
parent_request_id=728aa71759312a9d
- this can be shared between multiple web sessions, e.g. Firefox and wget.
There does not appear to be a security token required to prevent scrapers. I copied the request URL from Firebug to wget.
So, one could use this curl command to download the 68 results for "burrito" spread across 7 files:
curl "http://www.yelp.com/search/snippet?attrs=&cflt=&find_desc={burritos,sushi}&find_loc=santa+cruz,+ca&main_places=CA:Santa_Cruz::,CA:Capitola::,CA:Soquel::,CA:Felton::&mapsize=small&parent_request_id=728aa71759312a9d&rpp=10&show_filters=1&sortby=best_match&start=[1-5]0" -o "#1_#2.txt"
It'll take a little more page parsing to find where the total number of results is.
Form response
The response content is HTML embedded in JSON. Python's json
library handles the content just fine, and BeautifulSoup
parses the embedded HTML content just fine. (BeautifulSoup may even be included with everyone's Python now.)
The relevant business page links appear to follow this format:
<a id="bizTitleLink18" href="/biz/el-palomar-santa-cruz#query:burritos">
19. El Palomar
</a>
So, any anchor link with an id "bizTitleLink\d+"
should point to a business page.
Quick 'n' dirty Python code
These are the relevant parts of my Python session to see the above results.
import json
junk = open("burritos_1.txt", "r")
j = json.loads(junk.read())
import BeautifulSoup
bs = BeautifulSoup.BeautifulSoup(j["body"])
print(bs.prettify())