Get the starter code on GitHub Classroom. Please make sure to pull the most recent version of the Docker image, as one of the data files has been updated for this lab.
This week, you will explore the training and validation data for the Semeval 2022 shared task on Patronizing and Condescending Language Detection. We saw a sample of that data in Lab 3.
The data files you’ll work with this week are processed versions of the actual files released as part of the SemEval task. This contains 10,369 examples of condescending language in total. Here's a sample from the XML:
You will work with the data in /cs/cs159/data/patronize/patronize_full.xml (3.8MB) Two examples of one of these parsed examples (inside a root level examples tag) is below:
<example id="@@1824078" category="poor-families" country="tz" condescension="true" score="4">Camfed would like to see this trend reversed . It would like to see more girls in school . Basic Education Statistics in Tanzania ( BEST 2010 ) show that only 18 percent of girls have completed secondary school education . This is why Camfed supports girls from poor families to obtain secondary education and its efforts have seen many go to university .</example>
<example id="@@1921089" category="refugee" country="tz" condescension="false" score="0">Kagunga village was reported to lack necessary social services to meet the growing demand of refugees . The village has neither reliable , clean and safe water nor sanitation facilities that include latrines and critical medical services .</example>
Before moving on, look through this file to familiarize yourself with the structure. As usual, you'll be writing the answers to periodic questions in the lab writeup in analysis.md, which you'll render as a PDF using pandoc. If you don't have pandoc running on your computer, you can use these instructions to get it set up.
Read through the code in the provided PCLDataReader.py file. Add comments to the file, including docstrings for every function and block/line comments as necessary to demonstrate full understanding of how the code works.
There may be some Python code in this file that’s new to you, and that’s ok! Take some time now to read about any functionality you haven’t seen before so that you understand what it’s doing. As we approach final projects time, it's good to get comfortable looking up Python functions that can help with data processing.
Your comments should, specifically, demonstrate understanding of the roles of the following:
islice
yield
.clear()
ABC
@abstractmethod
Then, answer the following question in analysis.md.
Compare the function do_xml_parse to the function short_xml_parse. In what sense do they do the same thing? In what ways do they differ? Which is more scalable, and why? Your answer here should be specific in terms of resource usage (e.g. memory, processing, etc.) as far as what in an XML file would determine how much is needed of these resources and how they trade off.
In this lab, you will write code that extracts features from text examples and stores the features in a feature matrix X, with one row per example and one column per feature.
First, to set the stage, answer the following question in analysis.md.
We know that the training set has ~10,000 examples in it. If for every example we store the counts for 10,000 features (perhaps the most common 10,000 unigrams) and each feature is stored as an 8-bit (1 byte) unsigned integer, how much space would be needed to store the feature matrix in memory?
If a matrix has a lot of zeros in it, we may not want to use lots of memory to store those zeros. An alternative is to use a sparse matrix representation, which skips writing out all of the zeros and instead encodes where in the matrix entries are nonzero. There are several implementations of sparse matrices available in the SciPy library with different benefits depending on what operations you want to support efficiently and how zeros are distributed across rows and columns. The one that will be most useful to us is the lil_matrix: it allows us to initialize an array efficiently and to assign values to specific elements of the matrix as we go along.
But what is a lil_matrix? And what does "lil" stand for?
Briefly answer the following questions about the lil_matrix data structure.
(a) How is the data in a lil_matrix stored? The SciPy documentation may be helpful here.
(b) Assume you’re working with the same matrix as you were in Question 2: ~10,000 examples and 10,000 features. If only 1% of the elements in our X matrix entries are non-zero, what is the size of the resulting lil_matrix?
(c) What if instead of 1%, 10% of the elements are non-zero? To save memory, at what proportion of nonzero elements (if any) would it make sense to stop using the lil_matrix and instead use a “normal” numpy array?
(d) What kinds of operations would you expect to be slow in a lil_matrix?
Notice that the process methods in PCLFeatures and PCLLabels take a max_instances optional parameter. In both cases, this argument is used to help determine the size of matrices they create (the X and y matrices, respectively) and they are used as an argument to the do_xml_parse function. When you are working on debugging your code, you should pass in a value for max_instances that is small to help you iterate quickly. For example, when you are first starting out, you might want to set max_instances to something very small, like 5 or 10. Once you’re a bit more confident, you can set max_instances to a value that is small enough for it to run quickly but large enough that you’re confident things are working, for example 500. If you set max_instances to None, it will read through the XML file and determine the largest possible value for max_instances (which in this case is just over 10,000).
The following three subsections outline all of the code you’ll need to write this week. You should read through these three subsections before beginning any coding so that you have a big picture understanding of what you’re trying to build before you get started.
Sample output for this code is linked at the end of this section.
At the top of pcl_main.py, define your own derived class that inherits from PCLLabels. (If you need a refresher on Python classes and inheritence, you can check out the Python documentation.) Your class should be called BinaryLabels. In your subclass, you will need to define the _extract_label function that is used by the process function in the original class. (As you'll recall, in Python, there's no formal version of the private keyword like you might find in Java/C++, but we conventionally add an underscore prefix to member functions that aren't meant to be called external to the class.) In this function, you should extract the condescension attribute stored in an example taken from the ground-truth XML file. The condescension attribute is stored as the string "true" or "false", hence the name of your subclass.
The skeleton of this class should look something like this:
class BinaryLabels(PCLLabels):
"""docstring"""
def _extract_label(self, example):
"""docstring"""
# your code here
In addition to labels "true" and "false" for condescension, this data also has a categorization scheme for different categories of examples under the 'category' attribute (e.g., 'refugees'). This allows us to take a unique strategy to evaluation: rather than holding out a random sample of the data, we can hold out a category of the data and see if there are generalizable patterns from the other data!
Just below your BinaryLabels class, add another derived class that inherits from PCLLabels. This class should be called CategoryLabels, and should look very similar, but instead of pulling the true/false condescension label, it should pull the example's category.
In the same way you used PCLLabels to create the BinaryLabels class, you should add your own derived class to pcl_main.py to extract features that inherits from PCLFeatures. Your class should be called BagOfWordsFeatures. It should implement a Bag of Words feature set – that is, the features returned by your _extract_features method should be the counts of words in the example. Only include the counts for words that are already stored in your input vocabulary vocab. Words that are not in the input vocabulary should be ignored. Be sure to read through the code in PCLFeatures where _extract_features is called so you know what you should be returning from the _extract_features method.
The vocabulary you should use to initialize your vocab is stored in /cs/cs159/data/patronize/vocab.txt.
Note: You are only required to implement functions that match the interface of the PCLFeatures class, but you’re encouraged to add extra helper functions to modularize your code. As mentioned before, the names of helper functions that you won’t call directly from outside of the class definition should start with a single underscore (e.g., _my_helper_function(self)).
To run prediction on the Patronizing Data task, you’ll set up code in the pcl_main.py program. To get usage information, run the program with the -h flag:
$ python3 pcl_main.py -h
usage: pcl_main.py [-h] [-o FILE] [-v N] [-s N] [--train_size N]
[--test_size N] (-t FILE | -x XVALIDATE)
training labels vocabulary
positional arguments:
training Training instances
labels Training instance labels
vocabulary Vocabulary
optional arguments:
-h, --help show this help message and exit
-o FILE, --output_file FILE
Write predictions to FILE
-v N, --vocab_size N Only count the top N words from the vocab file
-s N, --stop_words N Exclude the top N words as stop words
--train_size N Only train on the first N instances. N=0 means use all
training instances.
--test_size N Only test on the first N instances. N=0 means use all
test instances.
-t FILE, --test_data FILE
-x XVALIDATE, --xvalidate XVALIDATE
Note: in the current file setup, the file you'll use to get the training file is the labels file; you'll just open the same file twice. (Since you're opening it in read mode, this should be fine.)
Once it has parsed the command-line arguments, pcl_main calls the function do_experiment, which has not been implemented. You will use many of these arguments in the do_experiment function, which should have the following outline:
Create an instance of PCLVocab.
Create an instance of BagOfWordsFeatures.
Create an instance of BinaryLabels and CategoryLabels. (These can be read from the same file, but you may need to call seek(0) from the file pointer.)
Create an instance of a classifier. You can use something from sklearn, such as MultinomialNB. (You can use your Decision List classifier from Lab 5, but it will require some reworking: while most scikit-learn classifiers support sparse matrix inputs, the DecisionList you made does not.)
Create feature and target (X and y) matrices from the training data.
Depending on the value of args.xvalidate and args.test_category, either:
If a test category is given, then you'll use all of the examples from that category as test data, keeping only the examples without that category as training data. (You can use numpy array slicing or np.where to help do this.) You should fit your model to the data and labels for examples outside the category test_category, then get predictions (and probabilities) for each instance matching the test category.
If a number of folds x is given, perform x-fold cross validation on the training data, getting predictions (and probabilities) for each instance in the training set. Hint: It may help to inspect what the “method” parameter does in scikit-learn's cross_val_predict function!
Regardless of which method was used to generate predictions, write out one line to args.output_file for each instance. The line should have three values in order, separated by spaces:
the instance id
the predicted class (true or false – do not include the quotes around the string)
your model’s confidence, which we’ll consider to be the probability of the predicted class
Please don't alter the order or format of the output file from the specification above, as semeval-pcl-2022-eval.py expects this format.
Sample output is provided for several of the functions above to help check that your code is working as intended. (This is in lieu of using Gradescope autograders, which wouldn't handle the size of this dataset well.) Note that it is a challenge to provide samples for everything you will try, especially since the dataset is so large. If there are particular samples you would like to see, it’s possible that they can be added as long as you give me some time to do so; please notify me on Discord if something conspicuous is missing.
For each of the questions below, perform the analysis with the following settings:
excluding 100 stop words
using a vocabulary size of 10,000 (after excluding stopwords)
10-fold cross-validation
using the entire training data file
Be sure to write out the labels and probabilities for the Multinomial Naïve Bayes classifier so you can inspect them: you will need those results to answer the 4 questions that follow.
Hint: If you’re running jobs like this over ssh to knuth, you may want to use the screen utility to start, and then detach from, your job. That way it won’t be killed if you lose your ssh session and free you up to do other things on knuth.
Use the Multinomial Naïve Bayes classifier MultinomialNB, along with at least two different DummyClassifier options, which use really simple strategies to provide prediction baselines. Comment on their relative performance using semeval-pcl-2022-eval.py to print statistics about the precision, recall, accuracy, and F1. (You can use the --help option to figure out the usage of the semeval-pcl-2022-eval.py script: it should take in an input data file, an input file with the predictions you generated in the previous section, and an optional output file to write the results.) What do your results tell you about the data set?
Note: because of the way the DummyClassifiers determine probabilities, using predict_proba may get you different results than expected, so it may make more sense to call predict directly.
Consider the examples your model is confident about:
From the Multinomial Naïve Bayes classifier output, identify (by id) three examples that your model is confident are condescending. Comment on the contents of the examples: What do you think makes your classifier so confident? Is your classifier right?
From the Multinomial Naïve Bayes classifier output, identify (by id) three examples that your model is confident are not condescending. Comment on the contents of the examples: what do you think makes your classifier so confident that they are not? Is your classifier right?
From the Multinomial Naïve Bayes classifier output, identify (by id) three examples that your model is not confident about – that is, examples for which your classifier’s prediction is very close to 0.5. Comment on their contents: what do you think makes these examples hard for your classifier? Do you find them hard to classify as a human? If not, what aspects of the examples do you take into account that are not captured by the features available to your classifier?
By default, MultinomialNB automatically figures out class priors from the training data. You can disable this when you first initialize the MultinomialNB object by setting the keyword argument fit_prior=False. How does this affect the results? Would you recommend this configuration over the default? Why or why not?
Based on your answers to the above, give a list of 3-5 additional features you could extract that might help the accuracy of your classifier. Make sure not just to list them, but also to comment briefly on how each of these would help.
Train the model again using the training data. Keep the other arguments the same but this time, rather than using cross validation, try holding out the vulnerable category. (Note that each of these categories represents roughly 1/10 of the dataset, though not exactly.)
From the Multinomial Naïve Bayes classifier output, identify (by id) three vulnerable examples that your model is not confident about – that is, examples for which your classifier’s prediction is very close to 0.5. Comment on their contents: what do you think makes these examples hard for your classifier? Do you find them hard to classify as a human? If not, what aspects of the examples do you take into account that are not captured by the features available to your classifier?
Based on your answers to the above, comment on whether this performed better or worse than cross validation. Did you find this result surprising?
Do other categories work better as held-out data sets? Try comparing the statistics between this category and a couple of others. Does the variation in results match what you would expect? (The other categories are 'disabled', 'homeless', 'hopeless', 'immigrant', 'in-need', 'migrant', 'poor-families', 'refugee', and 'women'.)
Reflect on what you think of this task. How well-defined is this task? How challenging? How useful? Is there anything you'd change?