Lab 7: Hyperpartisan News Classification

Starter Code

Get the starter code on GitHub Classroom. You should also check out the section marked Before Thursday in the Introduction to get ready to work with our dataset this week (as it will take a bit to copy).

Caution: the analysis requires commands that will take O(an hour) for the whole dataset, so you'll need to finish the coding part (everything before Question 4) with some time left over. The code itself isn't much more than past assignments, and there are some instructions for how to test your code is working with samples of the data to speed up your testing (you'll rely on these instead of a Gradescope autograder). However, at some point, you'll need to set these tasks running. If you feel like you're not making progress on the code, don't wait: reach out via Discord to the @grutors in #help.

Introduction (AND BIG DATASET LOGISTICS)

This week, you will explore the training and validation data for the Semeval 2019 Hyperpartisan News task. We saw a sample of that data in Lab 3.

Compared to the data sample you worked with earlier in the semester, the full training set is “big” in a couple of ways:

There are now ~600,000 training articles that you’ll need to process – way more than the 600 you used in Lab 3!
The large set of articles is labeled by publisher (source), not by article. In other woarticles from a publisher (e.g., "News Agency Alpha") are labeled true if News Agency Alpha is known to distribute hyperpartisan news, even if that label doesn't apply to all the news it distributes. This simplifying assumptions means you should expect labels to be a lot noisier.

Consequently, we’ve reached the point where it really matters whether your code processes files efficiently: that is, in a way that minimizes unnecessary memory usage or computation.

The data files you’ll work with this week are processed versions of the actual files released as part of the SemEval task. In particular, the text of every article has been pre-tokenized with spaCy, so that you can just split the tokens by whitespace. Additionally, all of the hyperlinks have been separated from the main text, so you don’t have to worry about filtering out HTML from the middle of the articles. This gives you some sense of how useful it can be to use tools like spaCy to pre-process and then store your pre-processed data in a nice format. (If you need significant processing for your final project, you should make sure that you're not redoing that processing every time you run your experiments; saving a file like this is one way to do that!)

An example of one of these parsed articles (one of many stored inside a root-level articles tag) is below:

<spacy>LINCOLN , Neb. ( AP ) -EOS- _ -EOS- The winning numbers in Tuesday evening ’s drawing of the “ 2 By 2 ” game were : Red Balls : 9 - 21 , White Balls : 10 - 19 ( Red Balls : nine , twenty - one ; White Balls : ten , nineteen ) Estimated jackpot : $ 22,000 ¶ Top Prize $ 22,000 . -EOS- LINCOLN , Neb. ( AP ) -EOS- _ -EOS- The winning numbers in Tuesday evening ’s drawing of the “ 2 By 2 ” game were : Red Balls : 9 - 21 , White Balls : 10 - 19 ( Red Balls : nine , twenty - one ; White Balls : ten , nineteen ) Estimated jackpot : $ 22,000 ¶ Top Prize $ 22,000 . -EOS-</spacy>

<lemma>lincoln , Nebraska ( ap ) -EOS- _ -EOS- the win number in tuesday evening ’s drawing of the " 2 by 2 " game be : red balls : 9 - 21 , white balls : 10 - 19 ( red balls : nine , twenty - one ; white balls : ten , nineteen ) estimated jackpot : $ 22,000 ¶ top prize $ 22,000 . -EOS- lincoln , Nebraska ( ap ) -EOS- _ -EOS- the win number in tuesday evening ’s drawing of the " 2 by 2 " game be : red balls : 9 - 21 , white balls : 10 - 19 ( red balls : nine , twenty - one ; white balls : ten , nineteen ) estimated jackpot : $ 22,000 ¶ top prize $ 22,000 . -EOS-</lemma>

<title>

<spacy>winning numbers drawn in ‘ 2 by 2’ game -EOS-</spacy>

<lemma>win number draw in ‘ 2 by 2’ game -EOS-</lemma>

</title>

</article>

You will work with the following files for this lab:

/cs/cs159/data/semeval/articles-training-byarticle-20181122.parsed.xml (7.0MB)
/cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml (6.8GB)
/cs/cs159/data/semeval/ground-truth-training-byarticle-20181122.xml (110KB)
/cs/cs159/data/semeval/ground-truth-training-bypublisher-20181122.xml (100MB)
/cs/cs159/data/semeval/vocab.txt (19MB)

It's likely that for some of the larger files, the text editor you usually use won't be up for looking at it. You can use the light-weight less command on the command line if you need to take a peek. Alternately, if you just want to see a sample of the file, you can run head -n 1000 filename.xml to see the first 1000 lines of the file (or change the number of lines for your needs). You can even combine these two: head -n 20000 filename.xml | less will let you view just the first 20000 lines of the file.

Before Thursday: Getting file access

The biggest of these files is not in the Docker image because of its size. If you want to work in Docker, you should first copy that file to your working directory using your username for knuth:

scp <yourusername>@knuth.cs.hmc.edu:/cs/cs159/data/semeval/articles-training-bypublisher-20181122.parsed.xml .

But also note that, even with careful data processing, this lab will use more memory than previous ones. You’ll see better performance if you give Docker access to as much memory as possible (16GB of RAM on my laptop). If you don’t have that much memory available on your own computer, you will probably want to run this lab on knuth. Whether you're downloading the data or not, please let me know right away if you still don't have knuth access!

If you do download the data to your own machine, do not add it to your GitHub repository! Git will run really slowly, and GitHub may reject your commits.

Before moving on, look through these files to familiarize yourself with them. They are similar to the ones you used in Lab 5, but have additional fields in them. As usual, you'll be writing the answers to periodic questions in the lab writeup in analysis.md, which you'll render as a PDF using pandoc. If you don't have pandoc running on your computer, you can use these instructions to get it set up.

Examining the Base Classes

Read through the code in the provided HyperpartisanNewsReader.py file. Add comments to the file, including docstrings for every function and block/line comments as necessary to demonstrate full understanding of how the code works.

There may be some Python code in this file that’s new to you, and that’s ok! Take some time now to read about any functionality you haven’t seen before so that you understand what it’s doing. As we approach final projects time, it's good to get comfortable looking up Python functions that can help with data processing.

Your comments should, specifically, demonstrate understanding of the roles of the following:

islice
yield
.clear()
ABC
@abstractmethod

Then, answer the following question in analysis.md.

Compare the function do_xml_parse to the function short_xml_parse. In what sense do they do the same thing? In what ways do they differ? Which is more scalable, and why? Your answer here should be specific in terms of resource usage (e.g. memory, processing, etc.) as far as what in an XML file would determine how much is needed of these resources and how they trade off.

Sparse Matrices

In this lab, you will write code that extracts features from articles and stores the features in a feature matrix X, with one row per article and one column per feature.

First, to set the stage, answer the following question in analysis.md.

We know that the training set has ~600,000 articles in it. If for every article we store the counts for 10,000 features (perhaps the most common 10,000 unigrams) and each feature is stored as an 8-bit (1 byte) unsigned integer, how much space would be needed to store the feature matrix in memory?

If a matrix has a lot of zeros in it, we may not want to use lots of memory to store those zeros. An alternative is to use a sparse matrix representation, which skips writing out all of the zeros and instead encodes where in the matrix entries are nonzero. There are several implementations of sparse matrices available in the SciPy library with different benefits depending on what operations you want to support efficiently and how zeros are distributed across rows and columns. The one that will be most useful to us is the lil_matrix: it allows us to initialize an array efficiently and to assign values to specific elements of the matrix as we go along.

But what is a lil_matrix? And what does "lil" stand for?

Briefly answer the following questions about the lil_matrix data structure.

(a) How is the data in a lil_matrix stored? The SciPy documentation may be helpful here.

(b) Assume you’re working with the same matrix as you were in Question 2: ~600,000 articles and 10,000 features. If only 1% of the elements in our X matrix entries are non-zero, what is the size of the resulting lil_matrix?

(c) What if instead of 1%, 10% of the elements are non-zero? To save memory, at what proportion of nonzero elements (if any) would it make sense to stop using the lil_matrix and instead use a “normal” numpy array?

(d) What kinds of operations would you expect to be slow in a lil_matrix?

Limiting the number of articles you read in

Notice that the process methods in HNFeatures and HNLabels take a max_instances optional parameter. In both cases, this argument is used to help determine the size of matrices they create (the X and y matrices, respectively) and they are used as an argument to the do_xml_parse function. When you are working on this task, whether its for this lab, a future lab, or your final project, you should pass in a value for max_instances that is small to help you debug. For example, when you are first starting out, you might want to set max_instances to something very small, like 5 or 10. Once you’re a bit more confident, you can set max_instances to a value that is small enough for it to run quickly but large enough that you’re confident things are working, for example 500 or 1000. (Even at 10,000, it should still take only a minute or two.) Not until you’re pretty confident that everything is working should you set max_instances to 600,000; you can also set it to None, which will read through the XML file and determine the largest possible value for max_instances (which in this case is 600,000).

Code you will need to implement

The following three subsections outline all of the code you’ll need to write this week. You should read through these three subsections before beginning any coding so that you have a big picture understanding of what you’re trying to build before you get started.

Sample output for this code is linked at the end of this section.

Implementing your own Labeler

At the bottom of HyperpartisanNewsReader.py, define your own derived class that inherits from HNLabels. (If you need a refresher on Python classes and inheritence, you can check out the Python documentation.) Your class should be called BinaryLabels. In your subclass, you will need to define the _extract_label function that is used by the process function in the original class. (As you'll recall, in Python, there's no formal version of the private keyword like you might find in Java/C++, but we conventionally add an underscore prefix to member functions that aren't meant to be called external to the class.) In this function, you should extract the hyperpartisan attribute stored in an article taken from the ground-truth XML file. The hyperpartisan attribute is stored as the string "true" or "false", hence the name of your subclass.

The skeleton of this class should look something like this:

class BinaryLabels(HNLabels):

"""docstring"""

def _extract_label(self, article):

"""docstring"""

# your code here

Implementing your own feature extractor

In the same way you used HNLabels to create the BinaryLabels class, you should add your own derived class to extract features that inherits from HNFeatures. Your class should be called BagOfWordsFeatures. It should implement a Bag of Words feature set – that is, the features returned by your _extract_features method should be the counts of words in the article. Only include the counts for words that are already stored in your input vocabulary vocab. Words that are not in the input vocabulary should be ignored. Be sure to read through the code in HNFeatures where _extract_features is called so you know what you should be returning from the _extract_features method.

The vocabulary you should use to initialize your vocab is stored in /cs/cs159/data/semeval/vocab.txt.

Note: You are only required to implement functions that match the interface of the HNFeatures class, but you’re encouraged to add extra helper functions to modularize your code. As mentioned before, the names of helper functions that you won’t call directly from outside of the class definition should start with a single underscore (e.g., _my_helper_function(self)).

Experiment Interface

To run prediction on the Hyperpartisan News task, you’ll use the hyperpartisan_main.py program. To get usage information, run the program with the -h flag:

$ python3 hyperpartisan_main.py -h

usage: hyperpartisan_main.py [-h] [-o FILE] [-v N] [-s N] [--train_size N]

[--test_size N] (-t FILE | -x XVALIDATE)

training labels vocabulary

positional arguments:

training Training articles

labels Training article labels

vocabulary Vocabulary

optional arguments:

-h, --help show this help message and exit

-o FILE, --output_file FILE

Write predictions to FILE

-v N, --vocab_size N Only count the top N words from the vocab file

-s N, --stop_words N Exclude the top N words as stop words

--train_size N Only train on the first N instances. N=0 means use all

training instances.

--test_size N Only test on the first N instances. N=0 means use all

test instances.

-t FILE, --test_data FILE

-x XVALIDATE, --xvalidate XVALIDATE

Once it has parsed the command-line arguments, hyperpartisan_main calls the function do_experiment, which has not been implemented. You will use many of these arguments in the do_experiment function, which should have the following outline:

Create an instance of HNVocab.
Create an instance of (a derived class of) HNFeatures. For this lab, the derived class will be BagOfWordsFeatures.
Create an instance of (a derived class of) HNLabels. For this lab, the derived class will be BinaryLabels.
Create an instance of a classifier. You can use something from sklearn, such as MultinomialNB. (You can use your Decision List classifier from Lab 5, but it will require some reworking: while most scikit-learn classifiers support sparse matrix inputs, the DecisionList you made does not.)
Create feature and target (X and y) matrices from the training data.
Depending on the value of args.xvalidate and args.test_data, either:
- If test data is given, create a feature matrix for the test data, fits your model to the training data, and gets predictions (and probabilities) for each article in the test set.
- If a number of folds x is given, perform x-fold cross validation on the training data, getting predictions (and probabilities) for each article in the training set. Hint: It may help to inspect what the “method” parameter does in scikit-learn's cross_val_predict function!
Regardless of which method was used to generate predictions, write out one line to args.output_file for each article with three values, separated by spaces:
- the article id
- the predicted class (true or false – do not include the quotes around the string)
- your model’s confidence, which we’ll consider to be the probability of the predicted class

Sample output

Sample output is provided for several of the functions above to help check that your code is working as intended. (This is in lieu of using Gradescope autograders, which wouldn't handle the size of this dataset well.) Note that it is a challenge to provide samples for everything you will try, especially since the dataset is so large. If there are particular samples you would like to see, it’s possible that they can be added as long as you give me some time to do so; please notify me on Discord if something conspicuous is missing.

Analysis

For each of the questions below, perform the analysis with the following settings:

excluding 100 stop words
using a vocabulary size of 30,000 (after excluding stopwords)
10-fold cross-validation
trained on the full by-publisher training data file

Be sure to write out the labels and probabilities for the Multinomial Naïve Bayes classifier so you can inspect them: you will need those results to answer the 4 questions that follow.

Warning: You should only continue with these questions if you are 100% certain that your code is working up to this point. Running each classifier in Question 4 will take about 30 minutes! You can continue with Q5-Q8 after just running the Multinomial Naïve Bayes classifier, which will allow you to run the DummyClassifier portion of Q4 while you are working in Q5-Q8.

Hint: If you’re running long jobs like this over ssh to knuth, you may want to use the screen utility to start, and then detach from, your job. That way it won’t be killed if you lose your ssh session and free you up to do other things on knuth.

Use the Multinomial Naïve Bayes classifier MultinomialNB, along with at least two different DummyClassifier options, which use really simple strategies to provide prediction baselines. Comment on their relative performance, and on what your results tell you about the data set. Briefly describe how these baselines compare to the baseline classifiers you considered in Lab 5.
From the Multinomial Naïve Bayes classifier output, identify (by id) three articles that your model is confident are hyperpartisan. Comment on the contents of the articles: What do you think makes your classifier so confident that they are hyperpartisan? Is your classifier right?
From the Multinomial Naïve Bayes classifier output, identify (by id) three articles that your model is confident are not hyperpartisan. Comment on the contents of the articles: what do you think makes your classifier so confident that they are not hyperpartisan? Is your classifier right?
From the Multinomial Naïve Bayes classifier output, identify (by id) three articles that your model is not confident about – that is, articles for which your classifier’s prediction is very close to 0.5. Comment on the contents of the articles: what do you think makes these articles hard for your classifier? Do you find them hard to classify as a human? If not, what aspects of the articles do you take into account that are not captured by the features available to your classifier?
Based on your answers to the above, give a list of 3-5 additional features you could extract that might help the accuracy of your classifier. Make sure not just to list them, but also to comment briefly on how each of these would help.

Testing on the By-Article Labels

Run the Multinomial Naïve Bayes classifier similarly to how you did for Question 4, but this time, train on the by-publisher training data and test on the by-article training data.

How do your results compare? Is this surprising or not? You may also want to look through the results and think about how you would answer Questions 5-8 based on the output.

From the Multinomial Naïve Bayes classifier output, identify (by id) three articles that your model is confident are hyperpartisan. Comment on the contents of the articles: What do you think makes your classifier so confident that they are hyperpartisan? Is your classifier right?
From the Multinomial Naïve Bayes classifier output, identify (by id) three articles that your model is confident are not hyperpartisan. Comment on the contents of the articles: what do you think makes your classifier so confident that they are not hyperpartisan? Is your classifier right?
From the Multinomial Naïve Bayes classifier output, identify (by id) three articles that your model is not confident about – that is, articles for which your classifier’s prediction is very close to 0.5. Comment on the contents of the articles: what do you think makes these articles hard for your classifier? Do you find them hard to classify as a human? If not, what aspects of the articles do you take into account that are not captured by the features available to your classifier?
Based on your answers to the above, comment on differences between the by-publisher and by-article data, the value of the by-publisher data as a training source, and anything else you observe.

Page updated

Report abuse