Prediction API
http://code.google.com/apis/predict/
http://code.google.com/p/google-prediction-api-samples/source/checkout
https://groups.google.com/group/prediction-api-discuss?pli=1
Blog Moderation Using the Google Prediction API
This app uses the Google Prediction API <http://code.google.com/apis/predict> to
perform comment moderation on a blog web application. It is able to
differentiate regular comments from "spam" blog comments. The application is
written in Python and runs on Google App Engine. Feel free to use this code as
a template in your own applications.
Step 1: Upload data to Google Storage
To perform this task, we first need a set of data that fits our criteria. Our
dataset needs both spam and non-spam ("ham") examples. In this example, we used
public domain physics textbooks to provide our ham examples. The motivation was
that this blog was reporting on current physics research for a mainstream
audience. In theory, training on physics books would give an adequate dataset.
Once your ham examples have been collected, the script parasplit.py may help
transform your examples into the comma separated format required:
$python parasplit.py INPUT_TEXT ham
This command will split the INPUT_TEXT, paragraph by paragraph, and turn each
into a single text feature, each with the classification of "ham".
Next, we would like some sample spam blog comments. There is a toy corpus of
spam blog comments hosted on ILPS:
http://ilps.science.uva.nl/resources/commentspam
We have provided a script spamparse.py:
$python spamparse.py
When run in the root directory of the corpus (with the blog-spam-assessments.txt
in the same directory), it will generate a CSV format for the spam comments.
You are, of course, encouraged to use your own datasets for your applications.
All data instances (ham, spam, any other classified data) must be aggregated
into one file. This file must be uploaded to Google Storage, either by gsutil
(for more details about gsutil, see
<http://code.google.com/apis/storage/docs/getting-started.html#getstart>) or
via the Google Storage Manager https://sandbox.google.com/storage/
Step 2: Train
You can make a training call either directly from a command line
(e.g., using cURL) or your favorite programming languages (see
<http://code.google.com/apis/predict/docs/libraries.html> for
third-party libraries). Here we make a training call using train.sh
from the code samples at
<http://code.google.com/apis/predict/docs/samples.html>:
$ get-auth-token.sh ENTER_MY_EMAIL ENTER_MY_PASSWORD
$ train.sh ENTER_YOUR_BUCKET/FILE.csv
ENTER_MY_EMAIL, ENTER_MY_PASSWORD, and ENTER_YOUR_BUCKET should be changed.
You also need to copy your authentication token to the root directory of the
App Engine application. Copy the auth-token file to the src/ directory
Step 3: Predict
Before you run the sample Python web app on the Google App Engine, you might
want to look at the App Engine documentation at
<http://code.google.com/appengine/docs/python/overview.html>.
Changes you will need to make to run this application on your own data:
- In app.yaml, you need to change APPLICATION_NAME to the name of your
application on App Engine
- You must copy your auth token to the file src/auth-token.
(generate using get-auth-token.sh)
- In blog.py, you need to change the MODEL_NAME (line 91) to your trained model,
(i.e. BUCKET/FILE.CSV)
You can run your application locally to test, see instructions here:
<http://code.google.com/appengine/docs/python/gettingstarted/helloworld.html>
Please send questions or comments about this demo app to
prediction-api-discuss group at
<https://groups.google.com/group/prediction-api-discuss>.
#!/usr/bin/env python
#
# Copyright 2010 Brian McKenna, Google Inc. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# This code is based on a Python library written by Brian McKenna which
# can be found at http://gist.github.com/407134.
"""This module provides an interface to the Google Prediction API.
"""
__author__ = 'Brian McKenna, Robert Kaplow'
import cgi
from getpass import getpass
import urllib
import urllib2
try:
import json
except ImportError:
from django.utils import simplejson as json
def GetAuthentication(email, password):
"""Retrieves a Google authentication token.
"""
url = 'https://www.google.com/accounts/ClientLogin'
post_data = urllib.urlencode([
('Email', email),
('Passwd', password),
('accountType', 'HOSTED_OR_GOOGLE'),
('source', 'companyName-applicationName-versionID'),
('service', 'xapi'),
])
request = urllib2.Request(url, post_data)
response = urllib2.urlopen(request)
content = '&'.join(response.read().split())
query = cgi.parse_qs(content)
auth = query['Auth'][0]
response.close()
return auth
def Train(auth, datafile):
"""Tells the Google Prediction API to train the supplied data file.
"""
url = ('https://www.googleapis.com/prediction/v1.1/training?data='
'%s' % urllib.quote(datafile, ''))
headers = {
'Content-Type': 'application/json',
'Authorization': 'GoogleLogin auth=%s' % auth,
}
post_data = json.dumps({
'data': {},
})
request = urllib2.Request(url, post_data, headers)
response = urllib2.urlopen(request)
response.close()
def Predict(auth, model, query):
"""
Makes a prediction based on the supplied model and query data. The query needs
to be a list, where the elements in the list are the features in the dataset.
The return is a tuple [prediction, scores], where:
In a classification task, prediction is the most likely label, and scores is
a dictionary mapping labels to scores.
In a regression task, prediction is the real-valued prediction for the input
data, and scores is an empty list.
"""
url = ('https://www.googleapis.com/prediction/v1.1/training/'
'%s/predict' % urllib.quote(model, ''))
headers = {
'Content-Type': 'application/json',
'Authorization': 'GoogleLogin auth=%s' % auth,
}
post_data = GetPostData(query)
request = urllib2.Request(url, post_data, headers)
response = urllib2.urlopen(request)
content = response.read()
response.close()
json_content = json.loads(content)['data']
scores = []
print json.loads(content)
# classification task
if 'outputLabel' in json_content:
prediction = json_content['outputLabel']
jsonscores = json_content['outputMulti']
scores = ExtractDictScores(jsonscores)
# regression task
else:
prediction = json_content['outputValue']
return [prediction, scores]
def ExtractDictScores(jsonscores):
scores = {}
for pair in jsonscores:
for key, value in pair.iteritems():
if key == 'label':
label = value
elif key == 'score':
score = value
scores[label] = score
return scores
def GetPostData(query):
data_input = {}
data_input['mixture'] = query
post_data = json.dumps({
'data': {
'input': data_input
}
})
return post_data
def main():
"""Asks for the user's Google credentials, Prediction API model and queries.
"""
google_email = raw_input('Email: ')
google_password = getpass('Password: ')
auth = GetAuthentication(google_email, google_password)
model = raw_input('Model: ')
query = []
message = 'Enter feature for classification. Type quit when done: '
while True:
feature = raw_input(message)
if feature == 'quit':
break
try:
float(feature)
query.append(float(feature))
except ValueError:
query.append(feature)
print query
print Predict(auth, model, query)
if __name__ == '__main__':
main()