This test case is to simulate the involuntary disruptions that frequently occur during the flow of a spoken speech. We induce disfluencies and filler words in the text to model this test case. For introducing disfluency, we repeat a few words at the start of each alternate sentence in a paragraph. For example, I like apples becomes I ... I like apples. For introducing filler words, we make a list of some common the filler words 2 and introduce them randomly in sentences according to the criterion c1. For example, I would like to tell you a story! is changed to Well...I would like to tell you hmm...a story!
The model is based on handcrafted features and regression methods. The features that are extracted can be categorized into four classes namely, Length-based features, Parts-of- Speech (POS), Word overlap with the prompt, Bag of n-grams . After extracting these features, a regression algorithm is used to build a model based on the training data. We have used support vector regression (SVR) as our regression algorithm.Support Vector Regression is a a type of support vector machine that supports both linear and non-linear regression. The problem with regression is to a function that approximates mapping from an input domain to real numbers on the basis of a training sample. The main aim here is to decide a decision boundary at ‘a’ distance from the original hyperplane such that data points closest to the hyperplane or the support vectors are within that boundary line.This model uses NLTK for POS tagging and stemming, aspell or pyspellchecker for spellchecking, and WordNet to extract the synonyms. Correct POS tags are generated using a grammatically correct text which were provided or one can also generate using a pretrained Grammatical Error Correction model. The POS tag sequences not included in the correct POS tags are considered as bad POS. We use python library, scikit-learn for extracting unigram and bigram features and also for implementing the regression based methods.
CountVectorizer is used for tokenizing a collection of text documents and building a vocabulary of known words, and also to encode new documents using that vocabulary. They use CountVectorizer to build two dictionaries, one normal text (unstemmed words and bigrams) and another using stemmed and spell corrected vocab using useful words/ngrams. One can also use TF-IDFVectorizer which takes care of Term Frequency (how often a given word appears in a document) and Inverse Document Frequency (down-scales words that appear a lot across documents) or also HashingVectorizer which is a one-way hash of words to convert them to integer without creating a vocabulary (efficient with large documents and vocabularies). Length-based features are extracted from the essays. This includes, length of the essay, word count, comma count, apostrophe count, punctuation count, characters per word, As listed in the above Table, prompt based features are also generated from the essay by taking an overlap with the question prompt by finding the number of words or synonyms of words from the essay that exist.
This model, takes an essay directly as an input and automatically learns the features from the essay. For this process, we implemented a recurrent neural network based architecture to automatically grade the essays in an end-to-end manner. Recurrent Neural Networks have been one of the most successful machine learning models to solve several tasks especially in NLP domain. They are considered to be more powerful that feed-forward neural networks as they are capable of learning complex patterns from data and sequences.
The RNN structure ideally should encode all information or features required to grade essays. However, since the essays are usually long, with around hundreds of words, the learnt vector representation might not be sufficient for accurate scoring. For this reason, they preserve all the intermediate states of the recurrent layer to keep track of the important bits of information from processing the essay. They continued experiments on long short-term memory units (LSTM) to identify the best choice for this task.
To enhance the coherence and relatedness that can be inferred by a model , two features are kept in mind:
1. To alleviate the inability of the current neural network architecture to model the flow of the essay, coherence of the text and semantic relatedness over the course of the essay.
2. To ease the burden of the recurrent neural network based model.
In order to capture the semantic relationship between points of an essay, the model tries to read the essay using a neural tensor layer. Multiple features of semantic relatedness are aggregated together across the the essay and are used as features for prediction. The semantic relationship between those multiple points is important because it is an indicator of writing flow and textual coherence. The auxiliary features aim to capture the logical and semantic flow of the essay. Secondly, these additional parameters from the external tensor serve as an auxiliary memory of the network which helps in improving the performance of the deep architecture. Overall, the proposed architecture performs sentence modelling and semantic matching in an end-to-end network.
Red shows increase of adversarial score when compared to original.
Blue shows no change of adversarial score when compared to original.
Green shows decrease of adversarial score when compared to original.
We have randomly picked a few samples of majority of test cases from different prompts to give a clear idea of how the adversarial samples are made and scoring is done. To check out the results in detail, with variety of metrics and all the prompts, please refer to the paper.
Calling Out Bluff!