Product Blog‎ > ‎

Self-Service Web Science API

posted Feb 16, 2011, 4:28 PM by mls mls   [ updated Mar 11, 2011, 2:48 PM by Abraham Bagherjeiran ]
A challenging problem with advertising for publishers is the fact that ad networks do not have the right information about a web page. Ad networks deal in inventory, which they view as homogeneous chunks of web pages. They typically aggregate this at the level of publishers (domains) and stop there. Most ad networks won't even talk to a publisher that does not fall into a clean group.

Here at ThinkersRUs, we think this should change.  We want to become the world leading provider of web science solutions. As part of this ambitious goal, we are building a product to help small publishers better monetize their pages and help advertisers reach better users. A key component in this product is the ability to quickly determine what type of content is on a page.

While we are still building the product, we would like to release the page categorization API, the first in our Science API. For now, this is freely available to anyone who wants to use it. All we ask is that you try it out and Contact Us with your comments.

  • Uses known standard topic taxonomies, which integrates with other products.
  • No software, setup, training required. Just stick this on any site and get started.
  • Real-time crawling. You do not wait (like Adsense) for hours to get started. This crawls and processes the page in real-time.

  • Standard Topical Taxonomies
    • Interactive Advertising Bureau (IAB) Contextual Topics
      • The top-tier topics as defined by the IAB.
    • Open Directory
      • An editorially maintained directory organized for easy navigation. You can expect these categories to be very deeply nested, which is great for adding features to your own analysis. These categories are good for directory-based navigation.
    • Delicious Tags:
      • Tags--words or phrases that are associated with pages by individual users through social bookmarking. You can expect these categories to be at a higher level of abstraction without any hierarchy. These are great for showing a tag cloud or just displaying on a page.
  • Inappropriate Content
    • Interactive Advertising Bureau (IAB) Non-standard Content
      • Flag a page as containing inappropriate content, meaning it may not be safe for mainstream customers.
Try out the demo
Check out the demo today. Enter a page and see the categories that appear.

If you do not want these taxonomies, try out Real-Time Learning API.

Getting Started with the API

There are 2 ways to use the API: either as POST or GET HTTP request methods.
  • Open Directory:
  • Delicious:
  • IAB Inappropriate Content:
  • IAB Contextual Topics:

POST Requests

Here is a sample CURL ( request to run the categorizer, illustrating the POST method:

curl -v -v  '' -d '{"url" : ""}

The '-v -v' options are to illustrate the headers: The '-d' option provides the post data in JSON format. Here is an excerpt of the response:

> POST /9 HTTP/1.1
< HTTP/1.1 200 ...
< X-Request-ID: 36581c47-7f27-4ccb-abd5-174174efeb71

The fields of the response are defined as follows:
  • request: This contains the request as provided to the categorizer. The contents of the request vary depending on the method requested.
    • url: Input arguments provided by the client.
  • id: request id generated by the server, needed for updating the results It is also included in the HTTP response header.
  • response: Output returned by server. Content depends on the API that is used. In this case, we are using component 5 (Delicious).
    • green: category of the page, value is the score. For the score, higher is better.
    • market: ...
As we see, each request is assigned a unique request ID in the response header. This request id will be needed in order to provide feedback about the labels. A selection of 5 categories is returned here, which should always contain at least the top-scoring result, which is "stocks" in this case.

GET Requests

Here is how the GET method works:

curl -v -v ''

Note that the 'callback' parameter is required. If you do not want to use a callback function, try the POST method described above.
The response is as follows:


This is a common trick to get cross-site javascripts to work. Follow these steps in your page to use the API:
  • Use the YUI JSONP method, which does something like this:
  • Generate a <script> tag with the src attribute pointing to the GET URL.
  • Define the callback function indicated in your request.
  • The same callback function could be used repeatedly or you can define a new one for each request.

Here is a sample ruby script that can generate the URL:

ruby -r cgi -r json -e 'j = {url: ""}; callback = "myFunc"; puts "{CGI.escape(j.to_json)}&callback=#{callback}"'

Here is a sample javascript that illustrates the use of the callback:

function classify() {
  var  taxonomy = 9;
  var params = '{"url":"' + url + '"}';
  params = encodeURIComponent(params);
  var api = "" + taxonomy + "?arg=" + params + "&callback=onPageCategorization";
  var s = document.createElement('script');
  document.getElementById('call').value = api;

This is the callback function that is called when the api call is complete.

function onPageCategorization(data) {
  req_id = data['id'];
  var map2 = data['response'];
  for (var k in map2) {
    var categ = k;
    var score = map2[k];


Because this is an Alpha-stage release, we have a few limitations:
  • Request Rate: 10 requests per minute, per IP address. This is to prevent people from spamming the site.
  • Errors: Occasionally, there will be errors in processing the HTML. You should receive a timeout response after 30 seconds. Please bear with us and try again; keep in mind this is alpha software.