History of Grades in Public Schools
A Brief History
Before 1850, grading and reporting were virtually unknown in U.S. schools. Most schools grouped students of all ages and backgrounds together with one teacher in a one-room schoolhouse, and few students went beyond the elementary level. The teacher commonly reported students' learning progress orally to parents during visits to students' homes. As enrollments increased in the late 1800s, however, schools began to group students in grade levels according to age (Edwards & Richey, 1947) and to use formal progress evaluations. In most cases, these were narrative reports in which teachers described the skills each student had mastered and those on which additional work was needed. The main purpose of such reports was to inform students when they had demonstrated mastery of the current performance level and were ready to move on to the next level. With the passage of compulsory school attendance laws in the late 19th and early 20th centuries, high school enrollments increased rapidly. Between 1870 and 1910, the number of public high schools in the United States rose from 500 to 10,000 (Gutek, 1986). Subject-area instruction became increasingly specific, and student populations became more diverse. Although elementary teachers continued to use narrative reports to document student learning, high school teachers began using percentages and other similar markings to certify accomplishment in different subject areas (Kirschenbaum, Simon, & Napier, 1971).
The shift to percentage grades was gradual, and few U.S. educators questioned it. The practice seemed a natural result of the increased demands on high school teachers, who now served growing numbers of students. But in 1912, a study by two Wisconsin researchers seriously challenged the reliability and accuracy of percentage grades. Daniel Starch and Edward Charles Elliott found that 147 high school English teachers in different schools assigned widely different percentage grades to two identical student papers. Scores on the first paper ranged from 64 to 98, and scores on the second paper ranged from 50 to 97. One paper was given a failing mark by 15 percent of the teachers and a grade of over 90 by 12 percent of the teachers. Some teachers focused on elements of grammar, style, neatness, spelling, and punctuation, whereas others considered only how well the paper communicated its message. With more than 30 different percentage grades assigned to a single paper and a range of more than 40 points, it is easy to see why this study created a stir among educators. Starch and Elliott's study was immediately criticized by those who claimed that judging good writing is, after all, highly subjective. But when the researchers repeated their study using geometry papers graded by 128 math teachers, they found even greater variation. Scores assigned by teachers to one of the math papers ranged from 28 to 95 percent. Some of the teachers deducted points only for a wrong answer. Others gave students varying amounts of partial credit for their work. Still others considered neatness, form, and spelling in the grades they assigned (Starch & Elliott, 1913).
These demonstrations of wide variation in grading practices among teachers led to a gradual move away from percentage grades to scales that had fewer and larger categories. One was a three-point scale that employed the categories Excellent, Average, and Poor. Another was the familiar five-point scale of Excellent, Good, Average, Poor, and Failing, or A, B, C, D, and F (Johnson, 1918; Rugg, 1918). This decrease in the number of score categories led to greater consistency across teachers in the grades assigned to student performance.
A Modern Resurgence Percentage grades continued to be relatively rare in U.S. schools until the early 1990s, when grading software and online grade books began to gain popularity among educators. Today, schools can choose from more than 50 electronic grading software programs. Because these programs are developed primarily by computer technicians and software engineers rather than educators, they incorporate scales that appeal to technicians—specifically, percentages. Like monetary systems based on the dollar, percentages have 100 levels that are easy to divide into increments of halves, quarters, and tenths. Percentages are also easy to calculate and easy for most people to understand. Thus, the resurgence of percentage grades appears to come mainly from the increased use of technology and the partialities of computer technicians, not from the desire of educators for alternative grading scales or from research about better grading practice.
Modern percentage grading scales differ significantly, however, from those that were used in the past. The 100-point scale that teachers employed in the early 20th century was based on an average grade of 50, and grades above 75 or below 25 were rare (Smallwood, 1935). In contrast, most modern applications of percentage grades set the average grade at 75 (which translates to a letter grade of C) and establish 60 or 65 as the minimum threshold for passing. This practice dramatically increases the likelihood of a negatively skewed grade distribution that is "heavily gamed against the student" (Carey & Carifio, 2012, p. 201). Ironically, neither this narrower grade distribution nor a century of research and experience in scoring students' writing seems to have improved the reliability of the percentage grades assigned by teachers.
Recently, Hunter Brimi (2011) replicated Starch and Elliott's 1912 study and attained almost identical results. Brimi asked 90 high school teachers—who had received nearly 20 hours of training in a writing assessment program—to grade the same student paper on a 100-point percentage scale. Among the 73 teachers who responded, scores ranged from 50 to 96. And that's among teachers who received specific professional development in writing assessment! So even if one accepts the idea that there are truly 100 discernible levels of student writing performance, it's clear that even well-trained teachers cannot distinguish among those different levels with much accuracy or consistency.