Prediction API

Blog Moderation Using the Google Prediction API

This app uses the Google Prediction API <http://code.google.com/apis/predict> to

perform comment moderation on a blog web application. It is able to

differentiate regular comments from "spam" blog comments. The application is

written in Python and runs on Google App Engine. Feel free to use this code as

a template in your own applications.

Step 1: Upload data to Google Storage

To perform this task, we first need a set of data that fits our criteria. Our

dataset needs both spam and non-spam ("ham") examples. In this example, we used

public domain physics textbooks to provide our ham examples. The motivation was

that this blog was reporting on current physics research for a mainstream

audience. In theory, training on physics books would give an adequate dataset.

Once your ham examples have been collected, the script parasplit.py may help

transform your examples into the comma separated format required:

$python parasplit.py INPUT_TEXT ham

This command will split the INPUT_TEXT, paragraph by paragraph, and turn each

into a single text feature, each with the classification of "ham".

Next, we would like some sample spam blog comments. There is a toy corpus of

spam blog comments hosted on ILPS:

http://ilps.science.uva.nl/resources/commentspam

We have provided a script spamparse.py:

$python spamparse.py

When run in the root directory of the corpus (with the blog-spam-assessments.txt

in the same directory), it will generate a CSV format for the spam comments.

You are, of course, encouraged to use your own datasets for your applications.

All data instances (ham, spam, any other classified data) must be aggregated

into one file. This file must be uploaded to Google Storage, either by gsutil

(for more details about gsutil, see

<http://code.google.com/apis/storage/docs/getting-started.html#getstart>) or

via the Google Storage Manager https://sandbox.google.com/storage/

Step 2: Train

You can make a training call either directly from a command line

(e.g., using cURL) or your favorite programming languages (see

<http://code.google.com/apis/predict/docs/libraries.html> for

third-party libraries). Here we make a training call using train.sh

from the code samples at

<http://code.google.com/apis/predict/docs/samples.html>:

$ get-auth-token.sh ENTER_MY_EMAIL ENTER_MY_PASSWORD

$ train.sh ENTER_YOUR_BUCKET/FILE.csv

ENTER_MY_EMAIL, ENTER_MY_PASSWORD, and ENTER_YOUR_BUCKET should be changed.

You also need to copy your authentication token to the root directory of the

App Engine application. Copy the auth-token file to the src/ directory

Step 3: Predict

Before you run the sample Python web app on the Google App Engine, you might

want to look at the App Engine documentation at

<http://code.google.com/appengine/docs/python/overview.html>.

Changes you will need to make to run this application on your own data:

- In app.yaml, you need to change APPLICATION_NAME to the name of your

application on App Engine

- You must copy your auth token to the file src/auth-token.

(generate using get-auth-token.sh)

- In blog.py, you need to change the MODEL_NAME (line 91) to your trained model,

(i.e. BUCKET/FILE.CSV)

You can run your application locally to test, see instructions here:

<http://code.google.com/appengine/docs/python/gettingstarted/helloworld.html>

Please send questions or comments about this demo app to

prediction-api-discuss group at

<https://groups.google.com/group/prediction-api-discuss>.

#!/usr/bin/env python

# Licensed under the Apache License, Version 2.0 (the "License");

# you may not use this file except in compliance with the License.

# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.

# This code is based on a Python library written by Brian McKenna which

# can be found at http://gist.github.com/407134.

"""This module provides an interface to the Google Prediction API.

"""

__author__ = 'Brian McKenna, Robert Kaplow'

import cgi

from getpass import getpass

import urllib

import urllib2

try:

import json

except ImportError:

from django.utils import simplejson as json

def GetAuthentication(email, password):

"""Retrieves a Google authentication token.

"""

url = 'https://www.google.com/accounts/ClientLogin'

post_data = urllib.urlencode([

('Email', email),

('Passwd', password),

('accountType', 'HOSTED_OR_GOOGLE'),

('source', 'companyName-applicationName-versionID'),

('service', 'xapi'),

])

request = urllib2.Request(url, post_data)

response = urllib2.urlopen(request)

content = '&'.join(response.read().split())

query = cgi.parse_qs(content)

auth = query['Auth'][0]

response.close()

return auth

def Train(auth, datafile):

"""Tells the Google Prediction API to train the supplied data file.

"""

url = ('https://www.googleapis.com/prediction/v1.1/training?data='

'%s' % urllib.quote(datafile, ''))

headers = {

'Content-Type': 'application/json',

'Authorization': 'GoogleLogin auth=%s' % auth,

}

post_data = json.dumps({

'data': {},

})

request = urllib2.Request(url, post_data, headers)

response = urllib2.urlopen(request)

response.close()

def Predict(auth, model, query):

"""

Makes a prediction based on the supplied model and query data. The query needs

to be a list, where the elements in the list are the features in the dataset.

The return is a tuple [prediction, scores], where:

In a classification task, prediction is the most likely label, and scores is

a dictionary mapping labels to scores.

In a regression task, prediction is the real-valued prediction for the input

data, and scores is an empty list.

"""

url = ('https://www.googleapis.com/prediction/v1.1/training/'

'%s/predict' % urllib.quote(model, ''))

headers = {

'Content-Type': 'application/json',

'Authorization': 'GoogleLogin auth=%s' % auth,

}

post_data = GetPostData(query)

request = urllib2.Request(url, post_data, headers)

response = urllib2.urlopen(request)

content = response.read()

response.close()

json_content = json.loads(content)['data']

scores = []

print json.loads(content)

# classification task

if 'outputLabel' in json_content:

prediction = json_content['outputLabel']

jsonscores = json_content['outputMulti']

scores = ExtractDictScores(jsonscores)

# regression task

else:

prediction = json_content['outputValue']

return [prediction, scores]

def ExtractDictScores(jsonscores):

scores = {}

for pair in jsonscores:

for key, value in pair.iteritems():

if key == 'label':

label = value

elif key == 'score':

score = value

scores[label] = score

return scores

def GetPostData(query):

data_input = {}

data_input['mixture'] = query

post_data = json.dumps({

'data': {

'input': data_input

}

})

return post_data

def main():

"""Asks for the user's Google credentials, Prediction API model and queries.

"""

google_email = raw_input('Email: ')

google_password = getpass('Password: ')

auth = GetAuthentication(google_email, google_password)

model = raw_input('Model: ')

query = []

message = 'Enter feature for classification. Type quit when done: '

while True:

feature = raw_input(message)

if feature == 'quit':

break

try:

float(feature)

query.append(float(feature))

except ValueError:

query.append(feature)

print query

print Predict(auth, model, query)

if __name__ == '__main__':

main()

Page updated

Google Sites

Report abuse