Page 633 of Stab and Gurevych’s paper
For preprocessing plain text essays, Stab and Gurevych proposed using DKPro, an NLP library, to tokenize, lemmatize, extract part-of-speech (POS), and parse constituents and dependencies, providing a wide variety of features to choose from and expand upon for argument identification.
For our software artifact, we pivoted away from DKPro as it was difficult to work with, largely outdated, and possessed limited documentation for us to work with. Instead, we used CoreNLP (Manning et al., 2014) from Stanford University as our NLP library, which provided all of the same features as DKPro and proved easier to use with Python.
For each token extracted from each essay, we found structural, syntactic, lexico-syntactic and probability features that became training data for a Conditional Random Field (CRF) to predict if a token was at the beginning of an argument component (labeled Arg-B), within an argument component (labeled Arg-I), or a non argumentative token (labeled O).
Specifically, structural features for each token refers to positional data such as position relative to punctuation and placement relative to its respective sentence, paragraph, and overall essay. Syntactic features included POS, the lowest common ancestor (LCA) between the current token and preceding token as well as the current token and its following token in the constituency tree, and the types of LCAs found. For lexico-syntactic features, we extracted the dependency head for each token and its constituency type, as well as its uppermost and right sibling nodes. The proposed probability feature was a conditional probability of whether a given token was Arg-B, Arg-I, or O based on preceding tokens, which Stab and Gurevych looked back to at most three preceding tokens.
Argument classification directly builds on argument identification because we extract the appropriately labeled tokens (Arg-B and Arg-I specifically) and derive lexical, structural, indicator, contextual, syntactic, probability, discourse, and embedding features to be trained with an SVM to classify the component as a Major Claim, Claim, or Premise.
Lexical features included the lemmatized component and preceding tokens that we used the TF-IDF Vectorizer from Scikit-learn (Pedregosa et al., 2011) for a more coherent and simplified representation. We also extracted dependency pairs from each component and selected the two thousand most frequently occurring pairs from the training data. For our dataset, we counted the number of instances of each of the top two thousand dependency pairs in each argument component.
Structural features included token statistics (number of preceding, component, and following tokens, labels for covering sentence and paragraph, and the ratio between the number of component tokens and sentence tokens) and component position information, such as whether it is first or last in paragraph or in the introduction or conclusion paragraphs and the number of preceding and following components in its covering paragraph.
Indicator features detailed whether certain indicator phrases (i.e. forward, backward, thesis, rebuttal, first-person) were present in the preceding or component tokens, and contextual features counted instances of indicator phrases in the component’s covering paragraph, as well as shared noun and verb phrases between the component and the introduction or conclusion paragraphs.
Syntactic features included the number of subclauses which we extracted from the constituency tree by counting the number of instances of the “S” label, the depth of the constituency tree, the tense of the main verb which we decided was the first verb present in the verb phrase of a component and determined by its label (if the verb label was VB, VBZ, or VBP it was present tense, if the verb was labeled as either VBN or VBD it was past tense, if it was labeled VBG it was future tense), whether modal verbs were present (such as shall and will), and the distribution of POS in a component.
Probability features were conditional probabilities of whether an argument was a Major Claim, Claim, or Premise based on its preceding tokens, which we used the preceding tokens from training data to estimate.
Discourse features specifically refer to discourse triples extracted from the Penn Discourse Treebank (PDTB) from the University of Pennsylvania, where the triple was the type of discourse present, whether it was implicit or explicit, and whether the component overlapped with the first argument in the discourse relation or the second argument. This was a challenge to implement since the original PDTB implementation exists in Ruby on Rails (Lin et al., 2014). To implement PDTB, we found a library for PDTB written in Java (Ilievski, 2015) based on the work of Lin et al., which we called from Python with plain-text essays as inputs. Embeddings were derived from a pre-trained Word2Vec model by Google (Mikolov et al., 2013) on a dataset of news articles. It is worth mentioning that this embedding model could only be found in Google’s archives and is at least a decade old.
Once we have obtained the components labeled with types, we can extract pairwise linguistic features from an essay to identify which components are related by a directed edge. A pair of components is a valid candidate for relation identification only if it is located in a single paragraph.
Because the authors note that performance improves when they exclude lexical features, we also do the same. Syntactic features include the merged POS dictionaries of both components and whether each of the 500 most common production rules in the dataset occurs in the component’s covering sentence. To obtain production rules, we relate each node in a constituent parse tree to all of its children given that the node is not the parent of a leaf node (i.e., the POS tag of a token) or a leaf (i.e., a token).
In terms of essay structure, we count the total number of tokens that occur within each pair and the number of components that occur between them and overall in their covering paragraph, and we extract binary features regarding the two components’ locations within the paragraph and the essay. Indicator features are the types of indicators that occur within, between, following, or preceding the source and target components in their covering paragraph. We then find whether each discourse relation occurs in either the source or target, and we also identify which nouns are shared by the pair.
For each lemma occurring in a valid pair, we calculate its pointwise mutual information (PMI) score between the lemma t and the direction d of a relation, either incoming or outgoing, by taking the log of the ratio between p(t,d) and p(t)p(d) to find whether a lemma is positively or negatively associated with a certain direction. For this step, we performed preprocessing to write out all the lemmas that occur in components with either incoming or outgoing relations according to the ground truth annotations of the training data. For the test set of AAE dataset and the subset of the ASAP Set 2 dataset, we calculated p(t,d) based on lemma occurrences of the training data. As pairwise features, we calculate the ratio of lemmas with positive or negative associations with a particular direction and whether such associations exist at all. To avoid running the CoreNLP annotators more than necessary, we output all production rules and lemma probabilities to the feature dictionary produced within the classification step.
To globally optimize and subsequently revise the outputs of both the base classifiers for component types and relations, we follow the authors’ framework for constructing a joint ILP model that conflates major claims and claims together for linkage with premises. For each component that does not have an outgoing edge in the optimized tree, we revise its label to “Claim.” Whereas the authors use the lpsolve framework in Java, we use Gurobi in Python.
We calculate the weight matrix for the objective function using the same parameters and scores computed based on the results from the relation identification model as the authors. We also follow the exact same constraints documented on page 643 of their paper except for the following constraint on the transitivity of relations, which uses a binary auxiliary variable bij for each pair such that bij = 1 means that there is a path from component i to j.
The constraint states that if there is a path from i to j and from j to k, then there must also be a path from i to k. However, when we implement this expression exactly, it always leads to an empty set of relations. With the original inequality, the constraint demands that for each linked pair of components, there must always be a path between them through another component. Conceptually, it is also too restrictive because it makes it illegal for components to not be linked to any other component, i.e., when all three auxiliary values are 0. Instead, we experimented with flipping the inequality and achieved perfect results on the AAE dataset when we feed in the ground-truth relations. Finally, we must also enforce the constraint that claims cannot point to each other in the tree.
Stance recognition classifies each claim as either directly supporting or supporting through rebuttal or counterargument of the overall essay. Stab and Gurevych reuse nearly all features extracted for component classification except for contextual, indicators, and probability and add sentiment features, which include a count for negative, positive and neutral words, the difference between the count for positive and negative words, and production rules. They also have five sentiment scores from the Stanford sentiment analyzer that includes a score for strong negative, weak negative, neutral, weak positive, and strong positive sentiment in a sentence. This feature is inaccessible in CoreNLP’s Python library, meaning we needed to calculate them ourselves. We were able to do this by finding the ratio of each sentiment category throughout the covering sentence’s sentiment-annotated constituency tree.
Argument Identification
The dataset provided by Stab and Gurevych is largely imbalanced, which does not come as a surprise. Based on the nature of the labels, there should be very few Arg-B labels since there is only one per argument component. Unsurprisingly, Arg-I is the most represented since most sentences in a persuasive essay serve as some form of argument, which also means that there should be few tokens labeled O since there is little room for non-argumentative tokens. We observe that our results are strikingly similar to that of Stab and Gurevych.
Argument Classification
While our results for Major Claim and Premise are in line with that of Stab and Gurevych, our performance with classifying claims leaves a lot to be desired. Based on the below confusion matrix, it seems that the biggest source of confusion was between Claims and Premises, which is something that the annotators of the Stab and Gurevych dataset mention as being a point of conflict as well. It is certainly interesting that our model experiences similar difficulties with differentiating between Claims and Premises, and it is also probable that the imbalance nature of the dataset played a role, since there are more Premises than Claims or Major Claims.
Argument Relation Identification
We find that both of our F1 scores for linked and not-linked component pairs are unfortunately much lower than that of the authors with a SVM that excludes lexical features. We followed the authors’ features generation process exactly for this step, so we are not certain what has caused this discrepancy in performance. Fortunately, the results from this base classifier are not the final relations of the argument structures. Instead, we rely on the results of ILP, which we detail in the next section.
Optimization using Integer Linear Programming (ILP)
Following the authors’ approach, we also revise the predicted components to only two categories, claim and premise, based on the ultimate trees. Only nodes with no outgoing edges are to be labeled as claims, and all the rest are premises. In total, our ILP model revised 268 claims to premises but no premises to claims. Just like the authors, we observe that identification of linked pairs significantly improves in performance with the application of ILP than with the base classifier, whereas the improvement for unlinked pairs is less dramatic. We find that we actually slightly outperform the authors in both categories for relation identification. However, our results for component revision are less fortuitous because we stagger behind the authors’ assessment scores for both component types, especially premises.
Stance Recognition
We trained an SVM on all features on training data that we balanced due to the overwhelming amount of “For” relations in Stab and Gurevych’s dataset and evaluated our model on test data that we left as is. We find that our results reflect that of Stab and Gurevych pretty closely, as our model’s performance at identifying “for” relations was significantly better than identifying “against” relations.