As promising as this technology seems, I have serious reservations about the proliferation of AES systems and algorithms for evaluating high-stakes written examinations. Klebanov and Madnani list several stakeholders such as students taking the GRE or TOEFL, migrants whose visa applications may be processed by AES systems, and teachers who are looking to use AES in the classroom as a learning tool who all stand to benefit from the scalability of a well-trained AES system as well as its reliability and consistency. However, this is on the condition that the AES system is well-trained. I argue that AES systems are not quite proficient enough to be used in an unbiased manner and ought to be limited as a learning resource and tool rather than the final arbiter of an essay score.
In our exploration of argument parsing, we were working with human annotated datasets, which we encountered several issues with. For example, Stab and Gurevych, whom we based our replication study off of, ran into the issue of confusion between claims and premise and required revisions to the rubric, showing how fundamental a rubric is to the quality and reliability of annotations as well as how bias and skewed datasets can result in undesirable results. In our implementation, we note similar results in how our predictive results were skewed, such as how our trained SVM model was able to distinguish Major Claims and Premises but was severely lacking in its ability to predict Claims. We also noted how relation identification and stance classification were also skewed, though it is also worth noting that these datasets were originally quite unbalanced and only after balancing our data were we able to achieve satisfactory results. Stance classification itself presented a unique challenge. Given that most writing tends to include arguments that directly support the stance of the essay and rarely are arguments via counterarguments or rebuttals employed in actual writing, our model initially optimized to label everything as a directly supporting claim, and even after balancing the dataset, we still noticed that our model was more performant at identifying supporting claims than counterarguments. Ideally, an AES system should be able to adapt to all forms of argumentation, and the fact that our implementation, as well as that of Stab and Gurevych, encountered serious issues related to annotator and sampling bias should be alarming for stakeholders whose lives may end up depending on a potentially biased algorithmic system.
Sources that build off of Stab and Gurevych, such as Marro et al., also ran into issues with rubrics and scoring. Specifically, Marro et al. ran into the issue of severity between assigning higher or lower scores for a given feature (i.e. cogency, reasonableness, and rhetoric) between annotators. Most modern AES systems are developed with some kind of supervised learning, meaning that human annotations are key to the success and evaluating reliability and bias in these annotations is ought to be a necessary step for any AES-related work. Artsein and Poesio define different commonly used coefficients for assessing the quality of annotations, with unweighted coefficient including Scott’s pi, Cohen’s kappa, and Fleiss’ kappa and weighted coefficients including Krippendorff’s alpha, Kirppendorff’s alpha_u, and Cohen’s weighted kappa. As an example for these coefficients in action, Marro et al. had an interesting finding with their annotations, for which they used the Fleiss’ kappa scores. Their annotations on cogency, reasonableness, and rhetoric were based on the following scale: [0, 10, 15, 20, 25]. They write, “Despite this substantial agreement, an issue for the annotators was the difficulty to opt for a precise score, like 25 or 20” (4187). As a result, the authors grouped together 10 and 15 as well as 20 and 25 given the general difficulty of disagreement if an essay was worth a 25 or a 20.
While argument parsing and AES in general is still a work in progress and should not be used to arbitrate significant writing samples, it is also true that these are valuable learning tools and may provide insights into essay writing. For example, Wambsganss et al. used the work of Stab and Gurevych to parse argumentation structures and provide individualized suggestions to improve the writing quality. Specifically, the proposed argumentation learning method parses writing and scores them based off readability using the Flesch Reading Ease Score, coherence by the proportion of sentences connected by a discourse marker, and persuasiveness by the proportion of claims that are well supported. Persuasiveness and coherence definitely benefit from the argumentation graph structure, where premises indicate whether they support claims through relation identification. The authors note that the technology was not only well received but also brought about better writing quality within the experiment group, highlighting the usefulness of AES.
As long as argumentation-centric essay evaluation models provide formative feedback and are trained upon annotated datasets that are reliable and fair, they should be widely incorporated into classroom settings as a tool to improve students’ writing skills and assist educators with grading and curricular design. I first advocate for the concrete advantages of applying such systems before highlighting evidence-based methods to ensure high-quality human annotations for the models’ training data.
For students, research has also shown that the feedback generated by an automated argument assessment tool facilitates an observable improvement in the quality of their persuasive writing when this feedback is formative, which means that it is continuous and individualized during one's writing process. Wambsganss et al. created a tool called Argumentation Learning (AL) that identifies argument structures using the method designed by Stab and Gurevych (hereafter S&G) which we replicate in our project, and it predicts scores accordingly for essay persuasiveness, coherence, and readability. In an empirical study, they found that participants using AL produced more compelling arguments featuring a higher degree of formality and persuasiveness, which they define to be the proportion of claims supported by evidence, than the control group without AL. From cognitive psychology, the authors theorize that a system which provides formative feedback on argumentation quality creates a sense of cognitive dissonance between the student’s perceived and actual writing ability which motivates them to improve. Without automated tools, it is challenging for educators to provide such formative feedback to all students in large classroom settings due to temporal and pedagogical limitations. On the other hand, Conijin et al. observe in their study that although individuals’ trust of an AES system varied based on the numerical grades assigned to their argumentative writing, this trust was not influenced by the presence of explanatory feedback at all. However, a key limitation of their system was its inability to provide local commentary on specific aspects of an essay, which AL does. Thus, we should encourage more implementation and adoption of systems that increase the quality and accessibility of writing education by providing formative feedback with local representational guidance on essays.
Such systems may also be an aid for pedagogy. As Klebanov and Madnani (hereafter K&M) note in their comprehensive and book-length literature review, argument-based AES systems can measure the extent to which structural deficiencies exist in students’ writing. In the creation of the Argument Annotated Essay dataset, S&G find that of the more than one thousand individual arguments (a major claim supported by at least one premise) in four hundred essays, only 140 argument structures feature any attack relations that indicate the presence of a rebuttal or counterargument. This deficiency made it very difficult for Marro et al. to assess the reasonableness of students’ arguments in their study building on S&G’s dataset because their metric for this dimension depends on the cogency of students’ rebuttals. In fact, there is a general consensus in the literature that students neglect to address different or opposing perspectives in their writing, which may be indicative of critical weakness in strategic decision-making and problem-solving. We are living in an age of information in which higher order thinking skills are much more important than the ability to simply regurgitate knowledge from memory, which is why we need efficient and effective educational tools that leverage argumentation mining to help students with dialectic and critical thinking. As we show in our demo, argument-based scoring models can also identify poor-quality essays that contain many unsupported claims, which means that educators can easily identify students who need more attention than others on improving their writing. Thus, these models may provide instructors valuable direction to revise and plan curricula for writing education.
In order for scores and formative feedback to be accurate, we must ensure that these models are trained on reliable human annotations, which is a significant challenge because raters are susceptible to issues of fatigue, inexperience, time-dependent effects, and differing linguistic backgrounds. Rater severity is the scale on which a rater operates irrespective of the specific essay, whereas rater bias is the the extent to which a rater assigns higher or lower scores based on construct-irrelevant reasons. For automated essay scoring, the construct of a test is the specific skill or ability it is designed to measure, and fairness is defined as the extent to which a test measures the same construct for all test-takers (K&M). One alarming construct-irrelevant factor that threatens fairness is the time of grading, as raters’ severity increases significantly over the course of a week of annotation, and these temporal effects are stubbornly resilient to intervention (Congdon and MeQueen). In these cases, the training data would be annotated in a way that unfairly favors essays graded on an earlier day, which may cause the model to correlate construct-irrelevant features learned from those essays to higher scores. Because such fluctuations in severity do not occur within the span of a single day, it is vital that researchers limit essay scoring and annotation tasks to as few days as possible for greater consistency.
There are also rubric and experience-related rater effects, as a trade-off exists between the granularity or sophistication of the rubric and the reproducibility of annotators’ decisions. Fine-grained, nuanced or high-level judgments made by expert annotators may be harder to replicate across studies and domains than the decisions made by naïve raters who adhere strictly to a written set of guidelines. Moreover, the more complex or numerous the categories are for the annotation task, the more likely that there will be a higher proportion of disagreement as confusion increases. In fact, S&G found that their annotators showed the most confusion on whether a component is a claim or a premise. Moreover, Marro et al. observed their raters having a difficult time opting for precise scores in certain ranges, so they collapsed certain categories to increase agreement between their annotators. Besides rubric modification, another effective strategy to counteract individual rater effects is to increase the number of raters. Annotator bias is the difference between a multi-distribution agreement coefficient, which attempts to account for individual rater habits, and a single-distribution one, which ignores individual bias (Arstein and Poesio, 2008). Importantly, the difference between these coefficients shrinks as the number of raters increase, meaning that individual biases and severity impact ultimate score prediction less. It is thus advantageous that the performance of AES systems is comparable to that of a group of expert raters, so there can be greater consistency in machine-scored outputs than human ratings (Wachsmuth and Werner).
In essence, dimension-specific AES systems can feasibly and effectively contribute to writing education if and only if they are designed to provide detailed formative feedback and are trained on human annotations that minimize inconsistency and bias by having multiple raters, a narrow time frame of annotation, and a well-designed rubric that minimizes confusion. While there are admittedly prominent concerns regarding their usefulness, there are efficacious methods to limit issues of reliability and design to make them powerful tools in classrooms and beyond.
Automated Essay Scoring (AES) systems have the potential to alleviate the burden on graders and enhance access to high-quality feedback. However, they should be employed only to a certain extent and should never fully replace expert human graders due to the potential for bias, which can lead to inaccurate results.
The implementation of AES systems, whether partial or complete, can significantly reduce the workload and stress on human graders in various sectors, including educators and admissions officers in academic institutions, as well as recruiters assessing job applicants' cover letters. As highlighted by Klebanov et al. (2022), AES has been integrated into the grading processes of prominent writing examinations like Pearson's Test of English, GRE, and TOEFL since the 1990s, aiming to ease the load on graders who must review a vast number of essays. This approach not only benefits educational assessments but also plays a crucial role in college admissions and hiring processes, lessening rater fatigue and enabling better-informed decisions that can profoundly affect individuals' lives.
Moreover, AES serves as a valuable tool in improving students' writing skills by functioning as a personalized tutor, granting students greater access to tailored feedback. As described by Klebanov et al. (2022), educators often struggle to provide consistently detailed and personalized feedback for every student on each assignment. The scalability of AES allows it to provide detailed, helpful feedback to every student in need. Additionally, as demonstrated by Stab et al. (2017), who emphasized its ability to pinpoint weaker essays, AES can make it easier for educators to direct their attention to students in need of further assistance. The integration of AES in education promotes the enhancement of writing skills for a broader audience.
However, regardless of how advanced the AES system becomes, human-graded essays in training data for Automated Essay Scoring (AES) systems are susceptible to issues of rater bias and reliability, which can subsequently affect the overall performance and trustworthiness of the AES system.
Various studies have highlighted the vulnerability of human-graded essays in training data. Klebanov et al. (2022) stated that AES models "may inadvertently encode discrimination into their decisions due to biases or other imperfections in the training data, spurious correlations, and other factors." This bias in human-graded essays can originate for numerous reasons. For instance, time is a critical factor affecting the reliability of human graders. Research by Congdon et al. (2000) demonstrated that graders' severity, or a rater's tendency to assign specific scores, can fluctuate significantly within a week. This temporal variation alone can introduce inconsistencies in scoring, as graders may be influenced by external factors or experience fatigue, further undermining the reliability of their assessments.
To address these issues, researchers have explored strategies to minimize bias and improve reliability in training data. Arstein et al. (2008) emphasized the importance of employing expert annotators, as they are better equipped to follow complex annotation guidelines and produce more reliable judgments. If expert annotators are not available, increasing the number of annotators can help mitigate individual rater bias, minimizing the impact of individual tendencies (Klebanov et al. 2022). Moreover, Amorim et al. (2018) introduced the concept of a "norm" for subjectivity in comments, allowing for the representation and removal of essays linked to biased ratings. By establishing clear guidelines for subjectivity, researchers can enhance the consistency and reliability of training data, leading to more accurate and fair AES evaluations.
Rater bias and reliability issues in human-graded essays in training data can significantly impact the performance and trustworthiness of AES systems. These issues can be addressed through the use of expert annotators, increasing the number of annotators, or establishing norms for subjectivity, but it is not possible to eliminate bias completely. Therefore, achieving a perfect AES system is not feasible, and even though it is a very good helper and learning aid, no student’s grade should ever depend solely on an AES system.