Calling Out Bluff!

This is a blog for the paper: Calling Out Bluff: Attacking the Robustness of Automatic Scoring Systems with Simple Adversarial Testing. Arxiv LinkBy- Yaman Kumar*, Mehar Bhatia*, Anubha Kabra*, Jessy junyi Li, Di Jin, Rajiv Ratn Shah

Are the existing E-raters good enough?

What is AES and why is it Important?

Automated essay scoring (AES) refers to the process of grading student essays without human interference. An AES system takes as input an essay written for a given prompt, and then assigns a numeric score to the essay reflecting its quality, based on its content, grammar, and organization, etc. Most AES systems are based on a supervised machine learning task (mostly by regression methods or preference ranking) applied to a set of carefully designed features or deep neural network systems which can effectively encode the information required for essay evaluation and learn the complex patterns in the data through non-linear neural layers.

Recent research has found that deep neural networks are vulnerable to adversarial samples and the existence has been shown in domains like image classification and speech recognition. State-of-the-art NLP models can often be fooled by adversaries that apply seemingly innocuous transformations to the input. Question Answering(QA) scoring methods in the form of GRE and TOEFL should especially be resistant to these adversaries since they are part of the student assessment process and responsible for deciding the future of students.

The Essay Scoring systems are formed according to the preconceived bias of responses, so generally when testing everything seems well to do because of the testing on this bias by the researchers. However we took a step against the conformation bias to see how these state-of-the-art models actually react!

Problem Statement

The problem statement is to see if the state-of-the-art based essay scoring systems are able to tackle a variety of methods that a common student may use to bluff and add innocuous perturbations in a narrative or argumentative or even reading comprehension question answering based situation. We evaluate various adversarial samples generated by changing the original responses as mentioned below (cite section) with existing E-raters (SOTA models). We then analyze how each SOTA model reacts to different test-case and if it is able to address the adversary or not.

Dataset

ASAP-AES: This Automatic Essay Scoring (AES) Dataset, contains a total of 12976 essays distributed over 8 prompts where each prompt can be interpreted as a different essay topic along with a different genre such as argumentative, narrative or source dependent response essays from reading comprehensions. The following Table shows the statistics across all prompts for this dataset. We see that each essay has an average of 150 to 600 words. Plotting the prompt-wise distributions, we see most students got a score between one-third or two-third of the total score.

We notice that essay scoring generally focuses on writing quality which includes ideas and elaboration, organization, style and writing conventions like grammar and spelling.

Lets TestCase!

We elaborate a few test that we simulated on responses.

• Repetition of Lines:

Students intentionally tend to repeat sentences or specific keywords in their responses in order to make it longer yet not out of context and to fashion cohesive paragraphs. This highlights the limited vocabulary of the writer or meagre knowledge and ideas about the main subject. To deal with this form of a bluff, we concentrated on three different approaches. Firstly, one or two sentences from the introduction part are repeated at the end of the response. Secondly, one or two sentences from the conclusion are repeated at the beginning of the response. Thirdly, one to three sentences are repeated in the middle of the response.

• Addition of songs/speeches/entities:

In this case, addition in the beginning, middle or end, all were scored in similar manner. In some cases adding in the middle and end were scored lower than adding in the beginning. A comprehensive list of song lyrics, speech lines and important entities (technology and science) was created. 25% lines of an average response size was added to each response in the test case to create adversarial test cases.

• Addition of prompt specific/unrelated Wikipedia Lines:

This adversarial sample was used mainly for fact-checking. We formed a list of important topics from each prompt using key-phrase extraction. The Wikipedia articles of each of these topics were extracted and sentence tokenized. Then some sentences were randomly selected from these articles and were added to each response. Similarly, random Wikipedia lines were picked up and listed , again added in strategic positions.

• Addition of Universal True and False statements:

We designed this test case to evaluate whether current Automatic Scoring systems are able to identify false claims or false facts in the student responses. We collected various false facts and true facts and added them to the responses according to the constraints mentioned above. Through our experiments, we demonstrate that rubrics for automatic scoring engines focus entirely on organization, writing skills etc ignoring the chance of bluff with these false statements. This outcome verifies the robustness of AS systems and encourages further research in this direction to make the scoring systems more secure

• Paraphrasing and Contractions:

In this test case, we use Wordnet synset (Miller 1995) to replace one word (excluding the stop words) randomly in each sentence of the response with its synonym. The motivation behind this was to understand the variation of scores given by the state-of-the-art AES models with synonymy relations. For example, “Tom was a happy man. He lived a simple life.” is changed to “Tom was a cheerful man. He lived an elementary life.”

• Grammatical Errors:

We formed two test cases to simulate common grammatical errors committed by students. The first one focused on changing the subject-verb-object (SVO) triplet of a sentence. A triplet of this form is selected from each sentence and jumbled up. In the second test case, we first induce some article errors by replacing the articles of a sentence with their common incorrect forms. Then we alter the subject-verb agreement of that sentence. Following that, we replace a few selected words with their corresponding informal conventions and generic slangs.

• Dis fluency Errors :

This test case is to simulate the involuntary disruptions that frequently occur during the flow of a spoken speech. We induce disfluencies and filler words in the text to model this test case. For introducing disfluency, we repeat a few words at the start of each alternate sentence in a paragraph. For example, I like apples becomes I ... I like apples. For introducing filler words, we make a list of some common the filler words 2 and introduce them randomly in sentences according to the criterion c1. For example, I would like to tell you a story! is changed to Well...I would like to tell you hmm...a story!

• Responses from Babel Generator:

These adversarial samples are completely false samples generated using Les Perelman’s B.S. Essay Language Generator (BABEL). BABEL requires a user to enter three keywords based on which it generates an incoherent, meaningless sample containing a concoction of obscure words and keywords pasted together.

Some examples of our test-cases (AddTruth, RepeatSent, AddSong, AddSpeech, BabelGen) adding innocuous perturbations to each response.

The E-raters we implemented and experimented with!

Model1: EASE : Enhanced AI Scoring Engine

The model is based on handcrafted features and regression methods. The features that are extracted can be categorized into four classes namely, Length-based features, Parts-of- Speech (POS), Word overlap with the prompt, Bag of n-grams . After extracting these features, a regression algorithm is used to build a model based on the training data. We have used support vector regression (SVR) as our regression algorithm.Support Vector Regression is a a type of support vector machine that supports both linear and non-linear regression. The problem with regression is to a function that approximates mapping from an input domain to real numbers on the basis of a training sample. The main aim here is to decide a decision boundary at ‘a’ distance from the original hyperplane such that data points closest to the hyperplane or the support vectors are within that boundary line.This model uses NLTK for POS tagging and stemming, aspell or pyspellchecker for spellchecking, and WordNet to extract the synonyms. Correct POS tags are generated using a grammatically correct text which were provided or one can also generate using a pretrained Grammatical Error Correction model. The POS tag sequences not included in the correct POS tags are considered as bad POS. We use python library, scikit-learn for extracting unigram and bigram features and also for implementing the regression based methods.

Description of Features Used in EASE Model

Pipeline of EASE, a feature extraction model

CountVectorizer is used for tokenizing a collection of text documents and building a vocabulary of known words, and also to encode new documents using that vocabulary. They use CountVectorizer to build two dictionaries, one normal text (unstemmed words and bigrams) and another using stemmed and spell corrected vocab using useful words/ngrams. One can also use TF-IDFVectorizer which takes care of Term Frequency (how often a given word appears in a document) and Inverse Document Frequency (down-scales words that appear a lot across documents) or also HashingVectorizer which is a one-way hash of words to convert them to integer without creating a vocabulary (efficient with large documents and vocabularies). Length-based features are extracted from the essays. This includes, length of the essay, word count, comma count, apostrophe count, punctuation count, characters per word, As listed in the above Table, prompt based features are also generated from the essay by taking an overlap with the question prompt by finding the number of words or synonyms of words from the essay that exist.

Model 2: LSTM Mean Over Time

This model, takes an essay directly as an input and automatically learns the features from the essay. For this process, we implemented a recurrent neural network based architecture to automatically grade the essays in an end-to-end manner. Recurrent Neural Networks have been one of the most successful machine learning models to solve several tasks especially in NLP domain. They are considered to be more powerful that feed-forward neural networks as they are capable of learning complex patterns from data and sequences.

Network Architecture of LSTM Mean over Time based model

The RNN structure ideally should encode all information or features required to grade essays. However, since the essays are usually long, with around hundreds of words, the learnt vector representation might not be sufficient for accurate scoring. For this reason, they preserve all the intermediate states of the recurrent layer to keep track of the important bits of information from processing the essay. They continued experiments on long short-term memory units (LSTM) to identify the best choice for this task.

Hyperparameters used in LSTM Mean Over Time based model

Illustration of LSTM Mean over Time model

Model 3: SKIPFLOW

To enhance the coherence and relatedness that can be inferred by a model , two features are kept in mind:

1. To alleviate the inability of the current neural network architecture to model the flow of the essay, coherence of the text and semantic relatedness over the course of the essay.

2. To ease the burden of the recurrent neural network based model.

In order to capture the semantic relationship between points of an essay, the model tries to read the essay using a neural tensor layer. Multiple features of semantic relatedness are aggregated together across the the essay and are used as features for prediction. The semantic relationship between those multiple points is important because it is an indicator of writing flow and textual coherence. The auxiliary features aim to capture the logical and semantic flow of the essay. Secondly, these additional parameters from the external tensor serve as an auxiliary memory of the network which helps in improving the performance of the deep architecture. Overall, the proposed architecture performs sentence modelling and semantic matching in an end-to-end network.

Illustration of SkipFlow model with width \delta

1. Embedding Layer:

The model accepts an essay and the target score as a training instance. The embedding layer is used to represent each essay as a fixed-length sequence,padding all sequences to the maximum length. This results in word embeddings. The parameters of the embedding layer are defined as W e ∈R |V| belongs to N where |V| is the size of the vocabulary and N is the dimensionality of the word embeddings. We have used pretrained Glove embeddings in our model, instead of training an embedding layer.

2. The Long Short-Term Memory (LSTM):

The sequence of word embeddings obtained from the embedding layer is then passed into a LSTM network. At every time step t, LSTM outputs a hidden vector h_t = LSTM(h t_1 , x_i ) (where x i and h_t_1 are the input vectors at time t) that reflects the semantic representation of the essay at position t. To select the final representation of the essay, a temporal mean pool is applied to all LSTM outputs.

3.Neural Tensor Layer:

A tensor layer is used to model the relationship between two LSTM outputs.

4. Fully-Connected Hidden Layer:

All the scalar values that are obtained from the tensor layer are concatenated together to form the neural coherence feature vector. The essay representation which obtained from a mean pooling over all hidden states is then concatenated with the coherence feature vector. This is then given as an input to the fully connected hidden layer.

5. Linear Layer with sigmoid:

This is the final linear regression layer. The output at this layer is the normalised score of the essay.

Model 4: BERT BASED TWO STAGED MODEL

They develop a Two Stage model mostly based on which makes full use of the advantages of feature-engineered and end-to-end methods which are both vital for the functioning of an AES System. In the first stage, they calculate three scores based on Long Short-Term Memory (LSTM) neural network as follows:

1. Semantic score (Se ): Prompt-Independent and utilized to evaluate essays from deep semantic level.

2. Coherence score (Ce ): Exploited to detect the essays composed of permuted paragraphs.

3. Prompt-relevant score (Pe ): Word Overlap with prompt is considered to discover prompt-irrelevant samples.

In the second stage, they concatenate these three scores with some handcrafted features, and the results are fed through an eXtreme Gradient Boosting model (XGboost) machine learning model for training.

First Stage:

– Semantic Score: LSTM used to map essays into low-dimensional embeddings,which are then fed to a dense output layer to transform the vector into a scalar value using Sigmoid function. They use the last hidden state rather than the average hidden state as it does not perform well comparatively. The objective they choose is the Mean Squared Error (MSE) loss function.

– Coherence Score: Calculated based on LSTM and objective of our coherence model is also built using the MSE. During the training process, they assume that the gold coherence scores of the original essays are equal to the corresponding hand marked scores. For those adversarial samples with some permutation, the gold coherence scores are set as 0.

– Prompt-relevant Score: This is also extracted using the LSTM using the last hidden state hidden state which is fed to a non-linear followed by sigmoid function and objective function based on MSE. Note: for prompt-irrelevant essays, the gold prompt relevant scores are defined as 0. Gold Scores for the training set are normalized to the range [0, 1] which are rescaled during the testing process. All LSTM neural networks are single-layer with hidden size of 1024. Mulilayer and bidirectional LSTMs were also tested for performance. Adam Optimizer is used to update training variables, using hyperparamters β 1 = 0.9, β 2 = 0.999 and = 0.000001.

Second Stage:

Some length-based handcrafted features are extracted. They first see the results using these features and then use features extracted in Feature Extraction using Regression method implemented in previous chapter and compare the results. They concatenate semantic score, coherence score and prompt-relevant score extracted in the first stage with handcrafted features and pass them through XGBoost machine learning model which uses Gradient Boosting Decision Tree (GBDT) and consists of two basic models including the tree model and linear model. Learning rate is as 0.001 and the max depth of tree is equal to 6.

Illustration of network structure of two-stage model using BERT

Model 5: MEMORY NETWORKS

Illustration of Network for Memory Networks based model

This model consists of four layers: input representation layer, memory addressing layer, memory reading layer, and output layer. Input representation layer is responsible for generating a vector representation for a student response. Memory addressing layer loads selected samples of student responses to memory, and assigns a weight to each memory piece. Afterward memory reading layer gathers the content from memory by taking weighted sum of each memory piece based on the weights calculated from previous layer, and produces a resulting state. Finally the output layer makes the prediction on the basis of the resulting state. Neural networks models are usually featured with multiple computational layers to learn a more abstract representation of the input.This model is extended to have the structure of multiple layers (hops) by stacking memory addressing layer and memory reading layer repeatedly.

Results and Analysis!

Kindly refer to the question regarding each prompt and the respective scoring range from table below.

Red shows increase of adversarial score when compared to original.
Blue shows no change of adversarial score when compared to original.
Green shows decrease of adversarial score when compared to original.

TEST CASE : Addition of Song Lyrics
TEST CASE TYPE : ADD PROMPT :1

TEST CASE : Addition of Speeches
TEST CASE TYPE : ADD PROMPT :7

EST CASE : Repetition of lines TEST CASE TYPE : ADD PROMPT :5

TEST CASE : Repitition of lines
TEST CASE TYPE : ADD PROMPT :6

TEST CASE : Removal of lines
TEST CASE TYPE : DEL PROMPT :7

TEST CASE : Adding Disfluency
TEST CASE TYPE : MOD PROMPT :2

TEST CASE : Inducing Grammatical Errors
TEST CASE TYPE : MOD PROMPT :4

TEST CASE : Response generated from BABEL
TEST CASE TYPE : GEN PROMPT :6

We have randomly picked a few samples of majority of test cases from different prompts to give a clear idea of how the adversarial samples are made and scoring is done. To check out the results in detail, with variety of metrics and all the prompts, please refer to the paper.

Adversarial Training

Finally, we tried training on the adversarial samples generated by our framework to see if the models are able to pick up some inherent “pattern” from the adversarial samples. The detail of this experimentation can be read from our manuscript. However, the adversarial training did not impact the scoring much. A slight improvement was noticed, but it was not enough to call this experimentation successful.

Conclusion

Calling Out Bluff!

Through our experiments. we conclude that recent AES systems built mainly with feature extraction techniques and deep neural networks based algorithms fail to recognise the presence of common-sense adversaries in student essays and responses. As these common adversaries are popular among students for ‘bluffing’ during examinations, it is vital for Automated Scoring system developers to think beyond accuracies of their systems and pay attention to complete robustness so that these systems are not vulnerable to any form of adversarial attack.

Page updated

Google Sites

Report abuse