Product Blog

We are committed to being the world's leading machine learned solutions provider. Here are some articles that explain our company and product vision.

Real-Time Learning API

posted Mar 7, 2011, 10:48 AM by mls mls   [ updated Mar 11, 2011, 9:31 AM by Abraham Bagherjeiran ]

At ThinkersRUs, our goal is to be the world's leading provider of machine learned solutions. When working with our clients, we find that many of them have what seem like different needs but can be addressed with a common solution.  Take page categorization, for example. This is known by different names in different fields.
  • Ad networks call this inventory filtering, supply classification, etc.
  • Publishers call this topics or navigational aids--such as Amazon product pages, Yahoo! / Google directories.
  • Forums call this threads or sub-forums, comment filtering (spam / advertising).
  • Users call this 'organization' such as OtherInbox, spam filtering, etc.
  • Bloggers, Tweeters, and Flickr'ers call this tagging.
In all these cases, the input is a web page (HTML), and the output is a score that gives you an indication of how relevant the category is to the page.

If all these problems are so similar, why isn't there a single common solution? Although there are companies that specialize in many of these fields (, TextWise, Proximic), there is no good solution that is easily deployable. Until now.

ThinkersRUs is releasing a simple API that will learn to solve any of the above page categorization tasks in real time based on feedback you provide. Here are some high-level use cases. See more detailed examples below. To try it out, see here.

Here are the main features:
  • Create categorizers as and when you need them.
  • Returns a score for any URL in real time.
  • Updates instantly the based on feedback you provide.
  • Call it from ANY web page via Javascript (using JSONP).
  • No page crawling needed.
  • No data preparation of any kind needed
  • No special software needed.
  • Its FREE! (for now)


What if I'm too lazy to train my own models? Great, try our pre-trained models for page categorization, and inventory mangement.

Can I checkout the RTL API? Try your luck at training your own model. When you are done, feel free to use this on any site you choose. Check it out...

Don't I need to prepare the data or have custom features? No, that's what is great about the API. Just point it to good and bad pages, and it will take care of the rest.

Isn't this the same thing as Google Predict? Google Predict is a batch trainer meaning that you upload your data and it gives you a model. This is nice, but you will need to do a lot of data preparation to make it work. With the RTL API, you just point it at a page and it takes care of the rest. Oh, and Google Predict is not a real-time learning api in that you have to retrain the model manually with RTL, just send some labels every now and then.

Isn't this the same thing as WEKA / R / Matlab? No. All of these softwares require significant time to prepare data and are not suitable for production SLA. No, we are not using these for the production server.

Isn't this the same thing as Open Calais, Textwise, Peer39, etc.? No. Although these services do provide page categorization, they are geared toward a specific pre-defined set of categories. The RTL allow you do define your own categories and leverage a similar SaaS architecture. FYI, our pre-trained solutions are better than these other companies.

Not sure how to use this, here are some examples.

Example 1: Self-Training Page Categorizers

In this simple example, in about 10 minutes you can train a page categorize that can serve your traffic. In order to build the categorizer, it needs to be trained with pages. How can you get training data? It is not that hard with some clever instrumentation.
  • Insert "smart" tags on your page. For example, this page is categorized as "API", "Real-Time", "Ruby", and "Finance"
  • As the users see these tags, depending on how your put the on the page, make it easy to remove the bad tags. For example, add links that call the train API. Users will see that "Ruby" and "Finance" are not relevant to this page but "API" and "Real-Time" are.
  • Obviously not many users will edit the tags, but that does not matter. It does not take many users to do a good job of being editors.
You can be your own editor in our demo.

Example 2: Behavioral Targeting

The largest web companies such as Yahoo!, Microsoft, and Google, all do some form of user targeting--showing ads to users based on their history. Yahoo! has a better targeting system (I am somewhat bias--this was my previous job).  Here is how you can do behavioral targeting with the real-time learning API:
  • Each time a user visits any of your pages, score the page using the API against several categorizers you have trained.
  • Add the number of page views in each category and store this information in the user's cookie.
  • Next time the user visits your page, you will have a better profile.

Example 3: Comment Spam / Filtering

A perennial problem among blogs and forums is to clean up the spam / bad comments. Often common solutions require a monthly fee and integration into your site. Now it is easy. Here is how:
  • Make each of your posts accessible via a URL.
  • For each post, pass the URL through the score API and filter randomly proportional to the score: Math.random() < response["_score"]. The randomization is recommended but not required.
  • Add a link to the post that calls the train API to correct the labels.

Real-Time Learning API

All API calls are done through a GET request or posting a JSON hash to the same url. There are three actions you can perform with the API:
  • newmodel:
    • Here, the _ts is added to bust caches that would return the same model all the time.
    • The response is the JSON object sent to your callback (onNewModel):
    • {"response":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227"},"request":{"action":"newmodel","rand":"12345"},"id":"5d73d507-ce4c-4dab-9f71-545694d19165"});

    • You can use "modelid" for future models. Don't lose the modelid, there is no way to recover it. ;)
  • score:
    • Where, "_url" is the URL of a page to score. We only support pages at this point.
    • The response is a JSON object send to your callback (onCategorize):
    • {"response":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","_score":0.5},"request":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","action":"score","url":"","rand":"0.457891345783"},"id":"009430ce-5c4d-4cb7-b883-81ba655c434e"}
    • Where, "_score" is the score of your model, the higher score the more likely the example is positive.
  • train:
    • Where, "_url" is the URL of a page to score, and "_label" is the desired score: {-1, 1}. We only support pages at this point.  The label 1 means that the example should have a high score next time it is scored. The label -1 means that the example should have a low score next time it is score.
    • The response is a JSON object send to your callback (onCategorize):
    • {"response":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","_score":0.5,"_loss":0.693147180559945,"_support":0.0},"request":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","action":"train","label":"1","url":"","rand":"0.457891345783"},"id":"c5680ef6-eb1a-4b35-96da-0ca670ee19fa"}
    • Where "_score" is the OLD score that BEFORE training, "_loss" is a measure of the error (this should go down over time), and "_support" is the number of examples that have been trained. In this case, this is the first example to be trained.
    • Now if we re-score the same example, the score has increased to 0.66.

Notes / Limitations

  • API currently scores and trains only on HTML pages accessible by public URL. We are working on generic text, users, etc.
  • The API crawls pages in real-time, so page load time may limit the response time of the API.
  • In order to get a good model, you should expect about 100 examples both positive and negative. Please follow the concept guidelines:
    • A model should be about one (1) cohesive concept.
    • The concept should be derived from the text of the page, now in comments in java script, ajax calls, iframes, etc.
    • Make sure to present examples in random order. The learning learns after each example, so it is sensitive to the order of examples.
  • In the alpha release, only 500 models are available in total, so don't be surprised if they are all taken by the time you get there.

Self-Service Web Science API

posted Feb 16, 2011, 4:28 PM by mls mls   [ updated Mar 11, 2011, 2:48 PM by Abraham Bagherjeiran ]

A challenging problem with advertising for publishers is the fact that ad networks do not have the right information about a web page. Ad networks deal in inventory, which they view as homogeneous chunks of web pages. They typically aggregate this at the level of publishers (domains) and stop there. Most ad networks won't even talk to a publisher that does not fall into a clean group.

Here at ThinkersRUs, we think this should change.  We want to become the world leading provider of web science solutions. As part of this ambitious goal, we are building a product to help small publishers better monetize their pages and help advertisers reach better users. A key component in this product is the ability to quickly determine what type of content is on a page.

While we are still building the product, we would like to release the page categorization API, the first in our Science API. For now, this is freely available to anyone who wants to use it. All we ask is that you try it out and Contact Us with your comments.

  • Uses known standard topic taxonomies, which integrates with other products.
  • No software, setup, training required. Just stick this on any site and get started.
  • Real-time crawling. You do not wait (like Adsense) for hours to get started. This crawls and processes the page in real-time.

  • Standard Topical Taxonomies
    • Interactive Advertising Bureau (IAB) Contextual Topics
      • The top-tier topics as defined by the IAB.
    • Open Directory
      • An editorially maintained directory organized for easy navigation. You can expect these categories to be very deeply nested, which is great for adding features to your own analysis. These categories are good for directory-based navigation.
    • Delicious Tags:
      • Tags--words or phrases that are associated with pages by individual users through social bookmarking. You can expect these categories to be at a higher level of abstraction without any hierarchy. These are great for showing a tag cloud or just displaying on a page.
  • Inappropriate Content
    • Interactive Advertising Bureau (IAB) Non-standard Content
      • Flag a page as containing inappropriate content, meaning it may not be safe for mainstream customers.
Try out the demo
Check out the demo today. Enter a page and see the categories that appear.

If you do not want these taxonomies, try out Real-Time Learning API.

Getting Started with the API

There are 2 ways to use the API: either as POST or GET HTTP request methods.
  • Open Directory:
  • Delicious:
  • IAB Inappropriate Content:
  • IAB Contextual Topics:

POST Requests

Here is a sample CURL ( request to run the categorizer, illustrating the POST method:

curl -v -v  '' -d '{"url" : ""}

The '-v -v' options are to illustrate the headers: The '-d' option provides the post data in JSON format. Here is an excerpt of the response:

> POST /9 HTTP/1.1
< HTTP/1.1 200 ...
< X-Request-ID: 36581c47-7f27-4ccb-abd5-174174efeb71

The fields of the response are defined as follows:
  • request: This contains the request as provided to the categorizer. The contents of the request vary depending on the method requested.
    • url: Input arguments provided by the client.
  • id: request id generated by the server, needed for updating the results It is also included in the HTTP response header.
  • response: Output returned by server. Content depends on the API that is used. In this case, we are using component 5 (Delicious).
    • green: category of the page, value is the score. For the score, higher is better.
    • market: ...
As we see, each request is assigned a unique request ID in the response header. This request id will be needed in order to provide feedback about the labels. A selection of 5 categories is returned here, which should always contain at least the top-scoring result, which is "stocks" in this case.

GET Requests

Here is how the GET method works:

curl -v -v ''

Note that the 'callback' parameter is required. If you do not want to use a callback function, try the POST method described above.
The response is as follows:


This is a common trick to get cross-site javascripts to work. Follow these steps in your page to use the API:
  • Use the YUI JSONP method, which does something like this:
  • Generate a <script> tag with the src attribute pointing to the GET URL.
  • Define the callback function indicated in your request.
  • The same callback function could be used repeatedly or you can define a new one for each request.

Here is a sample ruby script that can generate the URL:

ruby -r cgi -r json -e 'j = {url: ""}; callback = "myFunc"; puts "{CGI.escape(j.to_json)}&callback=#{callback}"'

Here is a sample javascript that illustrates the use of the callback:

function classify() {
  var  taxonomy = 9;
  var params = '{"url":"' + url + '"}';
  params = encodeURIComponent(params);
  var api = "" + taxonomy + "?arg=" + params + "&callback=onPageCategorization";
  var s = document.createElement('script');
  document.getElementById('call').value = api;

This is the callback function that is called when the api call is complete.

function onPageCategorization(data) {
  req_id = data['id'];
  var map2 = data['response'];
  for (var k in map2) {
    var categ = k;
    var score = map2[k];


Because this is an Alpha-stage release, we have a few limitations:
  • Request Rate: 10 requests per minute, per IP address. This is to prevent people from spamming the site.
  • Errors: Occasionally, there will be errors in processing the HTML. You should receive a timeout response after 30 seconds. Please bear with us and try again; keep in mind this is alpha software.

Query Categorization

posted Aug 30, 2010, 6:50 PM by mls mls   [ updated Apr 6, 2011, 11:26 AM by Abraham Bagherjeiran ]

I What is query categorization service?
Given a search query the service returns a number of categories which are semantically related to that query. Assume for instance, that you pass as an input the query "formula 1 cars". Categories that are semantically relevant to this query could be automotive/sportscars and sports/motorspots/F1.
II Where can query categorization help?
Knowing the semantic categories related to a certain query can have multiple uses. For example, an advertising company can use query categorization to perform better targetting - showing ads to users related to the categories of users' queries. In this way the ads will be more relevant to the interests of the users and the users will be more inclined to read and click on them. 
III How to invoke the service?

Calling the REST api

REST API in development for query categorization. Please try the page categorization API.


Calling the Javascript API:
<INPUT TYPE="text" id="in_tag" VALUE="">
mls_request('query_categorization_api', document.getElementById('in_tag').value, function (oReq, oJSN){parseReqData( oReq , oJSN );

1-3 of 3