Lab 1: Text processing

NEWS: Please fill in the evaluation for this lab; it's in GUL: https://gul.gu.se/courseId/38351/content.do?id=18418965.

In this assignment you will create a simple corpus from raw text. The main part is to write regular expressions for word tokenization. You will use the NLTK Treebank corpus, which has a raw text version that you can work on. This corpus contains ca. 5% of the Penn Treebank, (c) LDC 1995, which means around 100,000 words in 4,000 sentences.

Submission

You should submit the whole lab as one single Python file called lab1_your_name.py. (If you opt to do the VG assignment too, you should submit that as another file lab1vg_your_name.py).

Submit the files as ordinary mail attachments to peter.ljunglof@gu.se.

The file should be runnable from the command line without arguments, and print out all answers on the terminal, like this:

$ python lab1_peter_ljunglof.py

Part 1

------------------

(...)

Precision: 91.83%

Recall: 93.63%

F-score: 92.72%

Part 2

------------------

Nr. tokens: 66666

Nr. types: 66666

(...)

The code must be well documented, and all functions (no exceptions!) must have docstrings. All computations must be done in functions, the only things that are allowed on the top-level are, in this order:

1. module imports
2. definitions of constants
3. function and/or class definitions
4. a final run-time clause if __name__ == '__main__'

This is the structure, and it's strict:

# module imports

import nltk

import another_possible_module

(...)

# constants

corpus_size = (...)

token_regexp = r"""(...)"""

(...)

# function/class definitions

def a_function(with, some, arguments):

"""A mandatory docstring"""

(...)

return some_return_value

def another_function(more, arguments):

"""This docstring is also compulsory"""

(...)

# command line interpreter

if __name__ == '__main__':

do_this(...)

then_do_that(...)

# don't do too much here; call functions instead!

Acquiring the corpus

First you need to get the raw text version, and the gold standard list of tokens. They are in different NLTK corpora, treebank_raw and treebank_chunk, respectively. Furthermore, there are some differences that we need to fix: E.g., there are two different quotations in the gold standard ('' and ``), whereas there is only one in the raw text ("), so to make them comparable we translate the gold standard quotes into the raw form. You can use these two functions to get the raw and gold corpus:

def get_corpus_text(nr_files=199):

"""Returns the raw corpus as a long string.

'nr_files' says how much of the corpus is returned;

default is 199, which is the whole corpus.

"""

fileids = nltk.corpus.treebank_raw.fileids()[:nr_files]

corpus_text = nltk.corpus.treebank_raw.raw(fileids)

# Get rid of the ".START" text in the beginning of each file:

corpus_text = corpus_text.replace(".START", "")

return corpus_text

def fix_treebank_tokens(tokens):

"""Replace tokens so that they are similar to the raw corpus text."""

return [token.replace("''", '"').replace("``", '"').replace(r"\/", "/")

for token in tokens]

def get_gold_tokens(nr_files=199):

"""Returns the gold corpus as a list of strings.

'nr_files' says how much of the corpus is returned;

default is 199, which is the whole corpus.

"""

fileids = nltk.corpus.treebank_chunk.fileids()[:nr_files]

gold_tokens = nltk.corpus.treebank_chunk.words(fileids)

return fix_treebank_tokens(gold_tokens)

Tokenize the corpus

Create a function that tokenizes a given text:

def tokenize_corpus(text):

"""Don't forget the docstring!"""

(...)

return tokens

Use NLTK's regexp tokenizer as described in section 3.7. You can start with the example pattern and succesively improve it as much as possible. Use this evaluation function to test the result to the gold standard tokenization:

def evaluate_tokenization(test_tokens, gold_tokens):

"""Finds the chunks where test_tokens differs from gold_tokens.

Prints the errors and calculates similarity measures.

"""

import difflib

matcher = difflib.SequenceMatcher()

matcher.set_seqs(test_tokens, gold_tokens)

error_chunks = true_positives = false_positives = false_negatives = 0

print " Token%30s | %-30sToken" % ("Error", "Correct")

print "-" * 38 + "+" + "-" * 38

for difftype, test_from, test_to, gold_from, gold_to in matcher.get_opcodes():

if difftype == "equal":

true_positives += test_to - test_from

else:

false_positives += test_to - test_from

false_negatives += gold_to - gold_from

error_chunks += 1

test_chunk = " ".join(test_tokens[test_from:test_to])

gold_chunk = " ".join(gold_tokens[gold_from:gold_to])

print "%6d%30s | %-30s%d" % (test_from, test_chunk, gold_chunk, gold_from)

precision = 1.0 * true_positives / (true_positives + false_positives)

recall = 1.0 * true_positives / (true_positives + false_negatives)

fscore = 2.0 * precision * recall / (precision + recall)

print "Test size: %5d tokens" % len(test_tokens)

print "Gold size: %5d tokens" % len(gold_tokens)

print "Nr errors: %5d chunks" % error_chunks

print "Precision: %5.2f %%" % (100 * precision)

print "Recall: %5.2f %%" % (100 * recall)

print "F-score: %5.2f %%" % (100 * fscore)

Run the tokenizer, and evaluate. Look through the report and make appropriate changes to your regexp. Iterate this until you are satisfied. Note that you will not be 100% correct, since this is virtually impossible! (But you should at least be able to raise above 99%).

To make life easier, you can do something like this at the end of your file when you are working on your regexp:

if __name__ == "__main__":

nr_files = 20

corpus_text = get_corpus_text(nr_files)

gold_tokens = get_gold_tokens(nr_files)

tokens = tokenize_corpus(corpus_text)

evaluate_tokenization(tokens, gold_tokens)

Start with a small nr_files such as 20. When you have completed that, you can increase the size to 50, then 100, and finally 199.

Corpus statistics

Now you can use the tokenized corpus to answer the following questions:

- How big is the corpus (nr of words); and how many wordforms are there?
- What is the average word length?
- What is the longest word length, and which words have that length?
- How many hapax words are there? How many percent of the corpus?
  - Clarification: By this I mean that you should compare with the total size of the corpus, not with the number of word forms. And the same is meant for the following questions.
- Which are the 10 most frequent words? How many percent of the corpus consist of these words?
- Divide the corpus in 10 equally large subcorpora, sub[0]…sub[9].
  - How many hapaxes are there in c[0]? How many percent of c[0]?
  - How many hapaxes in c[1], given that you already have seen c[0]? How many percent of c[1]?
  - How many in c[2], given that you already have seen c[0] + c[1]? How many percent?
  - (…)
  - How many in c[k], given c[0] + … + c[k-1]? How many percent?
  - Hint: write a function that solves the general problem, and then just loop from 0 to 9
- Draw the results from the previous question in a graph.
- How many unique bigrams are there in the corpus? How many percent of all bigrams?
  - Note: I don't mean character bigrams as in the previous course, I mean word bigrams
- How many trigrams? How many percent?

Each of these questions shold be implemented in a function that takes the corpus as argument and returns the answer:

def nr_corpus_words(corpus):

"""Don't forget to docstring me!"""

nr_of_corpus_words = (...)

return nr_of_corpus_words

You should write a function that takes the tokenized corpus as argument and prints the statistics on the terminal:

def corpus_statistics(corpus):

"""Docstring, docstring, docstring!"""

do_some_calculations

print "Here is an answer: %5.2f %%" % (100.0 * nr_occurrences / total_nr)

print "And here is another: %s" % (", ".join(a_list_of_strings))

(...)

When you are finished you should make the Python file callable from the command line, and it should first tokenize the raw corpus, then print the error report from before, and finally print the answers to the questions.

Clarification: by "word" I don't mean anything linguistic; everything that the tokenizer recognizes as a token is a word. E.g., in the following example text, there should be 16 words and 8 word forms:

En såg såg en såg en såg såg, en annan sågade sågen sågen såg.

The following are the word forms:

, . En annan en såg sågade sågen

I.e., "En" and "en" count as different word forms, but "såg" (verb) and "såg" (noun) count as the same. "," and "." are also words and word forms in this sense. The most common words are "såg" (6 occurrences), "en" (3), and "sågen" (2).

The bigrams in the corpus are (En såg), (såg såg), (såg en), (en såg), (såg en), (en såg), (såg såg), (såg ,), (, en), (en annan), (annan sågade), (sågade sågen), (sågen sågen), (sågen såg), (såg .), in total 15 of them. But since some of them occurs more than once (en såg, såg en, såg såg), the number of unique bigrams are 12. The number of possible bigrams are (the number of wordforms)² = 8x8 = 64.

VG assignment: Wikipedia tokenization and statistics

In the optional VG assignment, you will create a program that automatically downloads a given Wikipedia article, cleans it, tokenizes, and prints statistics about it.

You should submit the VG assignment as one single Python file called lab1vg_your_name.py. The file should be runnable from the command line with one single argument, and print out all answers on the terminal, like this:

$ python lab1vg_peter_ljunglof.py "Natural language processing"

Statistics for the Wikipeda article "Natural language processing"

--------------------------------------------------------------------------

(...)

If the user forgets to give an argument, the program should print an informative help, such as:

$ python lab1vg_peter_ljunglof.py

Usage: python lab1vg_peter_ljunglof.py "The title of any Wikipedia article"

First you must download the article. See section 3.1 of the NLTK book on how to download an URL. Assuming that the article is "Natural language processing", the URL for downloading the raw wiki-formatted text is:

http://en.wikipedia.org/w/index.php?title=Natural+language+processing&action=raw

Create a function that takes the name of the article and returns the raw text:

def download_wikipeda_article(article):

"""Docstring"""

(...)

return the_raw_wiki_text

Before you can tokenize the text, you need to clean it from layout commands, formatting directives and other things that do not belong there. Write a function that cleans the raw Wikipedia text:

def clean_wikipedia_article(wiki_text):

"""Docstring"""

(...)

return cleaned_text

Probably you have to iterate this a lot until you get a result that you are satisfied with. Compare the cleaned text with the same article as shown in Wikipedia. Try to get rid of things such as the sidebar text, image captions, tables etc., while still keeping as much as possible of the "real" article text. Here are some information about what the Wikipedia formatting directives mean:

When you are finished cleaning you can tokenize the cleaned text using the regexp you developed for the treebank corpus. Take a look at the resulting tokenization and try to figure out if you can improve the regexp in any way.

Finally, your program should print some interesting statistics for the given Wikipedia article

def print_statistics(tokenized_text):

"Docstring, docstring, docstriiiing!"

(...)

At least the following should be printed, but you are also encouraged to come up with new interesting measures:

- How many words and wordforms are there?
- Average word length, longest words?
- How many hapax words? How many percent of the text?
- 10 most frequent words? How many percent?
- How many unique bigrams? How many percent of all bigrams?

The structure of the program must be the same as described under the section Submission above. To get the command-line arguments you can use the Python variable sys.argv:

if __name__ == '__main__':

import sys

if len(sys.argv) != 2:

exit("Wrong number of arguments, but you should give better help than this message!")

(...)

Page updated

Google Sites

Report abuse