Product Blog‎ > ‎

Real-Time Learning API

posted Mar 7, 2011, 10:48 AM by mls mls   [ updated Mar 11, 2011, 9:31 AM by Abraham Bagherjeiran ]
At ThinkersRUs, our goal is to be the world's leading provider of machine learned solutions. When working with our clients, we find that many of them have what seem like different needs but can be addressed with a common solution.  Take page categorization, for example. This is known by different names in different fields.
  • Ad networks call this inventory filtering, supply classification, etc.
  • Publishers call this topics or navigational aids--such as Amazon product pages, Yahoo! / Google directories.
  • Forums call this threads or sub-forums, comment filtering (spam / advertising).
  • Users call this 'organization' such as OtherInbox, spam filtering, etc.
  • Bloggers, Tweeters, and Flickr'ers call this tagging.
In all these cases, the input is a web page (HTML), and the output is a score that gives you an indication of how relevant the category is to the page.

If all these problems are so similar, why isn't there a single common solution? Although there are companies that specialize in many of these fields (Peer39.com, TextWise, Proximic), there is no good solution that is easily deployable. Until now.

ThinkersRUs is releasing a simple API that will learn to solve any of the above page categorization tasks in real time based on feedback you provide. Here are some high-level use cases. See more detailed examples below. To try it out, see here.

Here are the main features:
  • Create categorizers as and when you need them.
  • Returns a score for any URL in real time.
  • Updates instantly the based on feedback you provide.
  • Call it from ANY web page via Javascript (using JSONP).
  • No page crawling needed.
  • No data preparation of any kind needed
  • No special software needed.
  • Its FREE! (for now)

FAQ

What if I'm too lazy to train my own models? Great, try our pre-trained models for page categorization, and inventory mangement.

Can I checkout the RTL API? Try your luck at training your own model. When you are done, feel free to use this on any site you choose. Check it out...

Don't I need to prepare the data or have custom features? No, that's what is great about the API. Just point it to good and bad pages, and it will take care of the rest.

Isn't this the same thing as Google Predict? Google Predict is a batch trainer meaning that you upload your data and it gives you a model. This is nice, but you will need to do a lot of data preparation to make it work. With the RTL API, you just point it at a page and it takes care of the rest. Oh, and Google Predict is not a real-time learning api in that you have to retrain the model manually with RTL, just send some labels every now and then.

Isn't this the same thing as WEKA / R / Matlab? No. All of these softwares require significant time to prepare data and are not suitable for production SLA. No, we are not using these for the production server.

Isn't this the same thing as Open Calais, Textwise, Peer39, etc.? No. Although these services do provide page categorization, they are geared toward a specific pre-defined set of categories. The RTL allow you do define your own categories and leverage a similar SaaS architecture. FYI, our pre-trained solutions are better than these other companies.


Not sure how to use this, here are some examples.

Example 1: Self-Training Page Categorizers

In this simple example, in about 10 minutes you can train a page categorize that can serve your traffic. In order to build the categorizer, it needs to be trained with pages. How can you get training data? It is not that hard with some clever instrumentation.
  • Insert "smart" tags on your page. For example, this page is categorized as "API", "Real-Time", "Ruby", and "Finance"
  • As the users see these tags, depending on how your put the on the page, make it easy to remove the bad tags. For example, add links that call the train API. Users will see that "Ruby" and "Finance" are not relevant to this page but "API" and "Real-Time" are.
  • Obviously not many users will edit the tags, but that does not matter. It does not take many users to do a good job of being editors.
You can be your own editor in our demo.

Example 2: Behavioral Targeting

The largest web companies such as Yahoo!, Microsoft, and Google, all do some form of user targeting--showing ads to users based on their history. Yahoo! has a better targeting system (I am somewhat bias--this was my previous job).  Here is how you can do behavioral targeting with the real-time learning API:
  • Each time a user visits any of your pages, score the page using the API against several categorizers you have trained.
  • Add the number of page views in each category and store this information in the user's cookie.
  • Next time the user visits your page, you will have a better profile.

Example 3: Comment Spam / Filtering

A perennial problem among blogs and forums is to clean up the spam / bad comments. Often common solutions require a monthly fee and integration into your site. Now it is easy. Here is how:
  • Make each of your posts accessible via a URL.
  • For each post, pass the URL through the score API and filter randomly proportional to the score: Math.random() < response["_score"]. The randomization is recommended but not required.
  • Add a link to the post that calls the train API to correct the labels.

Real-Time Learning API

All API calls are done through a GET request or posting a JSON hash to the same url. There are three actions you can perform with the API:
  • newmodel: http://api.thinkersr.us/7?_action=newmodel&callback=onNewModel&_rand=12345
    • Here, the _ts is added to bust caches that would return the same model all the time.
    • The response is the JSON object sent to your callback (onNewModel):
    • {"response":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227"},"request":{"action":"newmodel","rand":"12345"},"id":"5d73d507-ce4c-4dab-9f71-545694d19165"});

    • You can use "modelid" for future models. Don't lose the modelid, there is no way to recover it. ;)
  • score: http://api.thinkersr.us/7?_modelid=17d1afd1-c79a-49b8-a1e9-5716991f6227&_action=score&_url=http://finance.yahoo.com&callback=onCategorize&_rand=0.457891345783
    • Where, "_url" is the URL of a page to score. We only support pages at this point.
    • The response is a JSON object send to your callback (onCategorize):
    • {"response":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","_score":0.5},"request":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","action":"score","url":"http://finance.yahoo.com","rand":"0.457891345783"},"id":"009430ce-5c4d-4cb7-b883-81ba655c434e"}
    • Where, "_score" is the score of your model, the higher score the more likely the example is positive.
  • train: http://api.thinkersr.us/7?_modelid=17d1afd1-c79a-49b8-a1e9-5716991f6227&_action=train&_label=1&_url=http://finance.yahoo.com&callback=onTrain&_rand=0.457891345783
    • Where, "_url" is the URL of a page to score, and "_label" is the desired score: {-1, 1}. We only support pages at this point.  The label 1 means that the example should have a high score next time it is scored. The label -1 means that the example should have a low score next time it is score.
    • The response is a JSON object send to your callback (onCategorize):
    • {"response":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","_score":0.5,"_loss":0.693147180559945,"_support":0.0},"request":{"modelid":"17d1afd1-c79a-49b8-a1e9-5716991f6227","action":"train","label":"1","url":"http://finance.yahoo.com","rand":"0.457891345783"},"id":"c5680ef6-eb1a-4b35-96da-0ca670ee19fa"}
    • Where "_score" is the OLD score that BEFORE training, "_loss" is a measure of the error (this should go down over time), and "_support" is the number of examples that have been trained. In this case, this is the first example to be trained.
    • Now if we re-score the same example, the score has increased to 0.66.

Notes / Limitations

  • API currently scores and trains only on HTML pages accessible by public URL. We are working on generic text, users, etc.
  • The API crawls pages in real-time, so page load time may limit the response time of the API.
  • In order to get a good model, you should expect about 100 examples both positive and negative. Please follow the concept guidelines:
    • A model should be about one (1) cohesive concept.
    • The concept should be derived from the text of the page, now in comments in java script, ajax calls, iframes, etc.
    • Make sure to present examples in random order. The learning learns after each example, so it is sensitive to the order of examples.
  • In the alpha release, only 500 models are available in total, so don't be surprised if they are all taken by the time you get there.
Comments