Collecting, Analyzing And Interacting With Data

Byte 1 v1

Description: Your final product will be the a web-enabled application that can query and display an RSS feed (should be something like [yourname]-byte1.appspot.com)
Source Code: See https://github.com/jmankoff/data, in Assignments/jmankoff-byte1
For some assignments we will provide complete or partial source code that you can look at. It is recommended that you try to construct your own source code using the tutorial and only refer to the provided code as needed. This is especially important since we build up the source code iteratively in the tutorial, gradually replacing portions of it, and the provided source code only shows a single view (the final version). In addition, it will rarely be the case that you can use that source code entirely unmodified to complete an assignment.

Overview

In this project, you will create a small application that displays data from an RSS feed. The RSS feed will be provided by Yahoo Pipes. The work you do in this project byte is something you will build on throughout this class. This assignment has the following learning goals:

Setting up your environment
A first experience with Python
Learning how to acquire data from an external source
A first experience with the RSS format
A first experience with programmable HTML
A first experience with forms
A first experience setting up a question and deciding what data helps to answer the question.

Detailed instructions for Byte 1

This project requires you to use python and some additional libraries that are available for python. To learn more about Python, you may want to explore www.pythontutor.com. The textbook Introduction to Computing and Programming in Python is an excellent introductory book aimed at non programmers.

Setting Up Pipes

The first part of this assignment involves constructing an RSS feed that provides you with some sort of interesting data. You can in principal write code to do this. However, we recommend that you use Yahoo Pipes instead. You are welcome to make your pipes as complex or simple as you please, however, this part of the assignment will be judged on creativity. Please do not use just a pipe someone else created, however. Once you have a pipe you are happy with, make note of the URL for that pipe. For example, this public News Collector in Yahoo will return a list of news items based on a search term. When I load it in my browser I see several options for accessing it:

Later in this assignment, we will use the "Get as RSS" link to access this feed programmatically.

Setting Up Python using Google Apps

Google Apps is a development environment that will let you place your code on the web with relative ease. An excellent "Getting Started" tutorial will walk you through the initial creation of a simple application that displays plain text on the web. Before doing that tutorial please read the notes below.

1) When you create a google web application, you will need a unique identifier for it that no one else on the web has used. A good idea for the assignments in this class is to prefix them with a unique id you choose (you can use your username, but then students grading you may know who you are, an anonymous id is fine too). An example is shown below.

2) The tutorial jumps from application creation to editing files. The files you need to edit will be in a new directory that the google application engine launcher creates in the application directory you specified, and the path is shown in the main launcher window.

I edited the file 'main.py' to show the name of this project.

When I finished the tutorial (including creating the application and uploading it for deployment at appspot.com) I had the following result:

Using Jinja to display HTML

To use Jinja, you first need to inform google that you will be using Jinja, by modifying "app.yaml" to include Jinja in its libraries. A key reason for doing this is that, for libraries provided by google, it is possible to control the version number that is used with your code and thus ensure that the code you write will not suddenly stop working due to a change that arrives with a new version of the library. However if you put 'latest' as we did below, then

libraries:

- name: webapp2

version: "2.5.2"

- name: jinja2

version: latest

Next, create a directory inside the [yourname]-byte1 folder named 'templates' and put a file named 'index.html' inside. 'index.html' should contain the following html.

<!DOCTYPE html>

<html>

<head>

<title>Byte 1 Tutorial</title>

</head>

<body> <h1>Data Pipeline Project Byte 1 Example</h1> <h2>Feed Contents</h2> </body> </html>

Finally, modify main.py by removing the adding the following just below your other imports webapp2 provides this library in a special place called 'webapp2_extras' (documentation). We used this toimport jinja2 when we said from webapp2_extras import jinja2.The code after the import creates a new class that uses jinja templates to render information. The key method that we will use later is render_response

# this is for displaying HTML

from webapp2_extras import jinja2

# BaseHandler subclasses RequestHandler so that we can use jinja

class BaseHandler(webapp2.RequestHandler):

    @webapp2.cached_property

    def jinja2(self):

        # Returns a Jinja2 renderer cached in the app registry.

        return jinja2.get_jinja2(app=self.app)

# This will call self.response.write using the specified template and context.

# The first argument should be a string naming the template file to be used.

# The second argument should be a pointer to an array of context variables

# that can be used for substitutions within the template

    def render_response(self, _template, **context):

        # Renders a template and writes the result to the response.

        rv = self.jinja2.render_template(_template, **context)

        self.response.write(rv)

You must now modify the MainHandler to subclass our new BaseHandler class. This helps to support a separation of concerns between generic rendering functions and those specific to our application.

# Class MainHandler now subclasses BaseHandler instead of webapp2

class MainHandler(BaseHandler):

# This method should return the html to be displayed

    def get(self):

# this will eventually contain information about the RSS feed

        context = {}

# here we call render_response instead of self.response.write.

        self.render_response('index.html', **context)

The result, when you load it, should look like this:

Collecting information from Yahoo Pipes in your application

Now that you can show static HTML on the web using Jinja, it is time to show dynamic information from your Yahoo Pipe. Recall that we can get the URL as RSS by copying the target of the link shown below:

For the news collector feed we have been exploring (with the search item "dogs") this link is: http://pipes.yahoo.com/pipes/pipe.run?_id=1nWYbWm82xGjQylL00qv4w&_render=rss&textinput1=dogs

You will need to modify main.py to take data from this feed and display it on the page. There are several ways to do this, ranging from downloading and parsing the raw html yourself to using a third party library that specializes in feeds. This homework will walk you through the latter solution. We will use the feedparser library. Feedparser documentation is available at http://pythonhosted.org/feedparser.

Because Feedparser is not part of the Python standard library, we will need to make sure Google has access to it. This requires downloading it, and copying it (specifically, the file feedparser.py into the same directory as main.py. Once this is done, you should be able to add import feedparser to the top of main.py. From the tutorial you completed earlier, you should know that you can quickly and easily test your scripts as you go using a local web page. The Google App Engine Launcher gives you the information you need to do this: Edit main.py, save it, and check that your application is running (There should be a small green arrow to the left of it in the application launcher):

You can ignore "Admin Port" for now. "Port" (which you would have designated at set up time) is the port on which your local application is running. The default is 8080, in which case you can view the results of your code at http://localhost:8080/ [Note: my port in the image above is 8082 because I changed the default, and my corresponding URL would be localhost:8082]. The "Logs" window is also extremely helpful. You can output debugging text there by using the python command Logging.info().

Thus, a very good debugging and editing cycle is [Edit main.py] [reload local web page] [check results and log to make sure your code is doing what you think it is] [rinse and repeat]

When you are debugging, you may sometimes find it useful to deploy your application, and then run the deployed version ([yourappname].appspot.com) in Hack ST (formerly the API Kitchen).

Now that we have a way of testing the code that you write, let's talk about how to parse a feed. The basic approach is as follows:

import feedparser

import logging

feed = feedparser.parse("http://pipes.yahoo.com/pipes/pipe.run?_id=dd9c60718c2a0168ebeeef663e6c1b8f&_render=rss")

for item in feed[ "items" ]:

    logging.info(item.published_parsed)

    logging.info(item.link)

    logging.info(item.title)

    logging.info(item.description)

You will of course need to put everything but the imports inside get(self) in main.py for this to work. Also, remember that python cares a lot about indentation (this is part of its syntax and can be a source of errors). However, you should be able to see the contents of the feed in your logging window once you have added this code.

Displaying the Feed Contents in Jinja

Your final step will be to collect the information that is currently being sent to the log and pass it to Jinja to display. One of the most powerful aspects of Jinja is its ability to display dynamic information provided by Python. We can pass one or more variables to Jinja by placing them in the context variable. We will also update 'index.html' to handle that information.

First, collect the information. We will take advantage of a python simplification for creating a list using a loop here:

feed = [{item.link, item.title, item.description} for item in feed["items"]]

One of the most powerful aspects of Jinja is its ability to display dynamic information provided by Python. We can pass one or more variables to Jinja by placing them in context:

context = {"feed" : feed}

Next, update the 'index.html' file to show the information:

<h2>Feed Contents</h2>

{% for item in feed %}

<a href="{{ item.link }}">{{ item.title }}</a><br>

{{item.description|safe}}

<br>

{% endfor %}

Note the use of {% ... %}. This indicates some logic that should be executed (in this case a for loop). The contents of {{ ... }} are replaced with their value. The use of the term |safe after item.description tells Jinja that item.description may have HTML in it and it should pass that HTML through without escaping it. This introduces a possible security hole and should be used with caution. The resulting output looks like this:

Letting the user ask their own question of your feed

Now that we have templates working, our next step is something slightly more complex: Asking the user to enter a search term in a form. We will start with a very simple form that you can add to your 'index.html' file:

<form action="search" method="POST">

  Search Term: <input name="search_term" value="dog"><br>

  <input type="submit" value="Enter Search Term">

</form>

We will also need a way to display the search results. This involves placing some additional logic inside the 'index.html' file to display the search terms (aids debugging) and results. We use the if / endif statements for error checking: If the term isn't present, the page will still render.

{% if search: %}

<p>Searching for {{search}}</p>

{% endif %}

Now we need to collect the form data. This involves adding a handler for post to 'main.py' as follows:

    def post(self):

        logging.info("post")

        terms = self.request.get('search_term')

        context = {"search": terms}

        self.render_response('index.html', **context)

Note that the input name specified in 'index.html' and the string used in self.request.get need to match up for jinja to show anything. In the code above, 'search_term' will show up (see below) but since we have not provided any results, that part of the web page will not render.

When this is done, after you type a search term in, the web page at http://localhost:8080/ should show the following:

[... and so on]

Finally, we need to use the search term result. First we need to make sure the webapp2 framework knows that we are accepting form responses by changing 'app.yaml' to send any urls to 'main' by using '/.*'

- url: /.*

  script: main.app

Similarly, the last line of 'main.py' needs to be:

# this sets up the correct callback for jmankoff-byte1.appspot.com

# This is where you would add additional handlers if you

# wanted to have more subpages on that website.

app = webapp2.WSGIApplication([('/.*', MainHandler)], debug=True)

Now we encode the search terms appropriately in a form that will work in the URL to take care of funny characters like spaces using urllib.quote() (We will need to import urllib for this to work). Then we need to pass it to Yahoo! and store the results in context. All of this is done inside the post method.

terms = self.request.get('search_term')

terms = urllib.quote(terms)

# This is the url for the yahoo pipe created in our tutorial

 feed = feedparser.parse("http://pipes.yahoo.com/pipes/pipe.run?_id=1nWYbWm82xGjQylL00qv4w&_render=rss&textinput1=" + terms )

feed = [{"link": item.link, "title":item.title, "description" : item.description} for item in feed["items"]]

context = {"feed": feed, "search": terms}

Finally, if we want to show the default search term in the form input, we change it to

<form action="search" method="POST">

  Search Term: <input name="search_term" value={{search}}><br>

  <input type="submit" value="Enter Search Term">

</form>

and change the get method in 'main.py' to pass in "dog" as the default search term:

context = {"feed" : feed, "search" : "dog"}

Now we should have a working search form. Here is an example showing the results of a search for "cats"Describe the flow of information from the end user (who enters a search term) until it is displayed back to the end user, in terms of the specific components relevant to the assignment (end user; jinja2; main.py; yahoo pipes).

Page updated

Google Sites

Report abuse