Hi Ranjana,
It looks like there may be a some reviews files missing from the latest release. I will look into the issue soon. La Tosca is a restaurant in Santa Cruz that has no reviews (http://www.yelp.com/biz/la-tosca-santa-cruz). The former release (non-blob, the one without cool/useful) is stable.
Aryeh
---
Aryeh,
There is one restaurant I found in restaurants.json, which is la-posta-santa cruz, But in the reviews it is named la-tosca. Is restaurants.json updated with the new restaurants in the review ? I am picking the review json file names from the restaurants.json. Thanks a bunch btw.
Ranjana
As mentioned on the main page, I have updated the Yelp data this evening. There are now 588 restaurants which are described, all which are in or around Santa Cruz. Several dozen restaurants lie outside of Santa Cruz, in such cities as Capitola, Bonny Doon, and other locales. The data are about 8mb compressed and a little over 20mb uncompressed. Find the updated data below.
Aryeh
------
README
Aryeh Hillman
January 30th, 2012
--- THE FILES ---
In total, there are 39 files.
The first file, "restaurants.json" is raw information (more information below) from Yelp's website about the 38 restaurants in Santa Cruz. The data are rich, and each file has some very interesting information, such as:
- unique restaurant id (i.e. alias)
- average rating with high precision (e.g. 3.65384615384615)
- longitude and latitude
- average cost (e.g. two dollar signs, three dollar signs)
The remaining files are reviews for each of the 38 restaurants. They contain information such as:
- rating
- unique user id
- date of review
- full text of the review
The name of each file is the unique alias name for each restaurant, suffixed by the "json" extension.
--- READING IN DATA ---
Perhaps the easiest way to read a file is by using Python's JSON decoder. For example:
import json
reviews = json.load(open("el-palomar-santa-cruz.json"))
for r in reviews:
print r['name']
... will give the names of all the people that reviewed the El Palomar restaurant.
--- POTENTIAL QUESTIONS ---
Potentially interesting questions (more to come!):
- Does cost of the restaurant affect repution
- What neighborhoods have the best culinary reputation
- Length of review and the rating (perhaps a bi-modal distribution)
- Hometown and rating
- Is there a strongly connected component that links every restaurant in Santa Cruz? If so, how many degrees of seperation are there between restaurants?
--- TECHNICAL DETAILS ---
The restaurant reviews are seperated into seperate files for ease of access. This is in part because there are caveats to working with large files in Python (e.g. I couldn't seem to easily work with strings over around 5574656 characters).
As mentioned, the list of restaurants ("restaurants.json") represents raw data directly from Yelp. That data represent the concatenation of 38 json files which are sent from the http://www.m.yelp.com to mobile devices, in order that information and reviews might be parsed for the mobile version of Yelp. Thus, the data is fairly rich and is organized in a pattern that is most likely optimal for processing on mobile devices. Since this data was already in JSON format, relatively little parsing was necessary.
The 38 review datasets were gathered through queries to the main Yelp website, http://www.yelp.com. Each restaurant's URL has the form http://www.yelp.com/biz/ALIAS, where the alias is a unique name assigned to each restaurant. For example, The Crêpe Place uses the alias "the-crepe-place-santa-cruz." Webpages were downloaded using a combination of Bash with curl, and Python with HTTPLib2 (HTTPLib2 is a fantastic alternative to the built-in HTTPLib).
Relatively extensive parsing was necessary to scrape reviews from the main Yelp website. First, data was downloaded using the primary URL. Many restaurants had more that one page of reviews, however, so there were ~N/40 other pages to download, where N is the number of reviews for the restaurant (page two for the Crepe Place, for example, was obtained from http://www.yelp.com/biz/the-crepe-place-santa-cruz?start=40). Each page of reviews was parsed using the Python library called Beautiful Soup (BS). BS did an absolutely fabulous job parsing the HTML. The parsed HTML was then entered into a Python dictionary and parsed using the JSON encoder (i.e. file.write(json.dumps(dictionary))).
--- CAVEATS ---
Yelp censors a fair number of reviews and perhaps their heuristic is reasonable. Censored reviews are relatively easy to access, if you know where to look for them (there is a link on each page with the phrase "(n Filtered)"). Obtaining those data for analysis was not practical for the scope of this project, however, since Yelp only serves the data after successfully responding to a CAPTCHA. Since primary purpose of the CAPTCHA may be to prevent web crawlers from accessing the content, there may be workarounds to automate the gathering of the data.
Within the data, there is one restaurant which is not in Santa Cruz proper, the Shadowbrook Restaurant. When limiting one's search to restaurants in Santa Cruz, the restaurant comes up in search results. I can think of three potential reasons: (1) it is very well regarded in ratings and has many ratings (2) advertising money. It is relatively easy to tell that the restaurant is not in Santa Cruz since the json file is entitled "shadowbrook-restaurant-capitola.json".