How vocabulary test data from word bands are interpreted will depend on the purpose for giving the test. Thus, it is necessary to first explain the difference between a vocabulary size and a vocabulary level.
The following text was taken from McLean, S., & Kramer, B. (2015). The Creation of a New Vocabulary Levels Test. Shiken, 19(2), 1-11.
The full paper can be found here. http://teval.jalt.org/node/33
Measuring vocabulary size and interpreting vocabulary size test scores Vocabulary size tests are intended to estimate the total number of words a learner knows. This estimate can be useful when comparing groups of learners, measuring long-term vocabulary growth, or providing “one kind of goal for learners of English as a second or foreign language” (Nation, 2013, p. 522). The Vocabulary Size Test (VST) (Nation & Beglar, 2007), for example, is a measure of written receptive word knowledge based on word family frequency estimates derived from the spoken subsection of the BNC (Nation, 2006). Each item on the VST presents the target word first in isolation followed by a non-defining context sentence, with four answer-choices presented in either English or in the learners’ L1. Results of the VST among samples with a wide range in ability have shown that the test is able to reliably distinguish between learners of different vocabulary proficiency, either using the monolingual version (Beglar, 2010) or the various bilingual variants (Elgort, 2013; Karami, 2012; Nguyen & Nation, 2011).
Despite the VST’s utility in separating students as a general measure of written receptive vocabulary knowledge breadth, inferences based on these results should be made with caution. For example, one of the stated interpretations of the VST is as an approximate estimate of known vocabulary. As the test samples 10 words each from the most frequent 1,000-word frequency bands (up to the 14th or 20th band depending on the version), “a test taker’s score needs to be multiplied by 100 to get their total vocabulary size” (Nation, 2013, p. 525). A score of 30 out of 140, for example, would produce a size estimate of 3,000 known word families. While this score interpretation seems straightforward, it carries with it two assumptions which must be addressed: a) the target words on the VST are representative of the frequency bands which they were sampled from, so that each target word can be considered to represent 100 others, and b) correctly answering an item implies the written receptive knowledge of that target word. The first assumption, that the target words on the VST are representative of the frequency bands which they were sampled from, can be sufficiently assumed because the words were randomly sampled according to Nation and Beglar (2007). The second assumption, however, is a bit more problematic as the item format utilizes a 4-choice multiple-choice format, implying a 25% chance that the item would be correctly answered even if the examinee has absolutely no knowledge of the target word. While Nation (2012) recommends that all participants complete the entire 14,000-word version of the VST, McLean, Kramer, and Stewart (2015) showed that most correct answers for low proficiency students at the lowest frequency bands could be attributed to chance rather than lexical knowledge.
In order to increase the accuracy of the VST results, Beglar (2010), Elgort (2013), and McLean, Kramer, and Stewart (2015) recommend that students only take the test two levels above their ability. While this would reduce the previously mentioned score inflation due to mismatched items, the resultant score would not hold much pedagogical value. While some suggest that a VST score can be used to assign reading materials (Nation, 2013; Nguyen & Nation, 2011), this claim ignores the properties of the construct being measured (vocabulary breadth) as well as findings which argue that comprehension of reading materials require learners to know at least 95% of the words within the materials (e.g. Hsueh-chao & Nation, 2000; Laufer, 1989; van Zeeland & Schmitt, 2013). This is because while a vocabulary size score can give a rough estimate of the amount of words known, it does not imply knowledge of all vocabulary within that size estimate. For example, McLean, Hogg, and Kramer (2014) reported that the mean vocabulary size of Japanese university students (N = 3,427) was 3,396 word families (SD = 1,268) using the VST. These same learners, however, could not be said to have knowledge of the most frequent 3,396 word families, as all but the most able students had gaps in their knowledge of items from the first 1,000 words of English and all students failed to correctly answer multiple-choice items at the second and third 1,000-word bands.
Similar gaps have been found with the first and second 1,000-word frequency bands by Beglar (2010), Elgort (2013), Karami (2012), and Nguyen & Nation (2011). In order to measure knowledge of the most frequent vocabulary levels, a test made for that purpose is more appropriate.
Measuring knowledge of vocabulary levels and interpreting vocabulary levels tests scores
While the VST may be an appropriate instrument for separating students with a wide range of proficiencies, a more pedagogically useful measure of lexical knowledge is a test designed to measure the degree of mastery of the most frequent words of English. The most well-known of such tests, the Vocabulary Levels Test (VLT) (Nation, 1990; Schmitt, et al., 2001) was designed to provide richer information about learners’ knowledge of the second, third, fifth, and tenth 1,000-word frequency bands, as well as Coxhead’s (2000) Academic Word List (AWL). The primary purpose of a levels test such as this is to estimate learners’ mastery of the most frequent vocabulary in the hope of assigning appropriate learning materials. For example, Nation (2013) states that meaning-focused reading input, which would include activities such as extensive reading and many kinds of task-based instruction, requires instructional materials to be written at a level with 95% known vocabulary. The test scores and their interpretations reflect this purpose, usually represented as a score out of 30 items for each level of the test, with mastery being a high proportion of correct answers at that level. Teachers can then use these results to help students focus on the most frequent unknown words until mastery is achieved.
Interpreting vocabulary levels tests inline with Nation’s (2007) four strands.
Fluency development activities involve learners reading materials that contain no unknown words (Nation, 2007). Thus, a mastery threshold of 100% is necessary, or correctly answering 100% of the questions of the target word band. Meaning-focused input materials (including extensive reading) require learners to know 98% of the tokens within them (Nation, 2007; Webb & Nation, 2017). Thus, an appropriate mastery threshold is 98%, or correctly answering 98% of the questions representing the target word bands. If the purpose is reading comprehension, then research suggests that an appropriate mastery threshold is 95% (Laufer, 1989; Schmitt et al., 2011), or correctly answering 95% of the questions representing the target word band. If the purpose for reading is language-focused instruction, Stoeckel et al. (2020) and Schmitt et al. (2011) suggest a threshold no lower than 85%. Thus, correctly answering 85% of the questions representing the target band is appropriate.
The following text was taken from McLean, S. (2021). The coverage comprehension model, its importance to pedagogy and research, and threats to the validity with which it is operationalized. Reading in a Foreign Language, 33(1), 126-140. https://nflrc.hawaii.edu/rfl/item/528
The Coverage Comprehension Model
When learners can comprehend 98% or more of the tokens within a text, the lexical difficulty of the text is unlikely to inhibit reading comprehension (Schmitt et al., 2011). This phenomenon will be referred to as the Coverage Comprehension Model (CCM). The CCM is present in countless articles that describe the percentage of tokens necessary to comprehend reading materials (e.g., Nation, 2006). Further, numerous studies operationalize the CCM to provide evidence that participants were able to comprehend reading materials (e.g., Feng & Webb, 2020) by estimating (a) the lexical difficulty of a text and (b) the lexical mastery level of a learner.
Applying research when deciding purpose-specific mastery thresholds.
Schmitt et al. (2001, p. 67) state that “[l]ike Read, we carried out a Guttman scalability analysis (Hatch and Lazaraton, 1991), using a criterion of mastery of 26 out of the 30 possible per level. (This figure was chosen to be as close as possible to Read’s criterion of 16 out of 18.)”. Read (1988, p. 17) looked at the scalability of scores from the 1,000-word bands of Nation's (1983) levels tests and states that “[a] score of 16 was taken as the criterion for mastery of the vocabulary at a particular level.” Read “set the cut score at 16/18 based on [his] reading on criterion-referenced testing at the time, which indicated that a score equivalent to 90% was widely accepted as the criterion for mastery, so 16/18 represented 90% for a VLT level” (J. Read, personal communication, January 28, 2021). However, this criterion-based research was not related to the lexical knowledge necessary for reading. Thus, “contemporary vocab researchers need to revisit the mastery cut-off in light of recent developments in the field, their research aims and the targeted purposes for reading, rather than just quoting Read (1988) or Schmitt et al. (2001) as authorities” (J. Read, personal communication, January 28, 2021). One issue with the 26/30 threshold is that 27/30 (90%) is closer to 16/18 (88.8%) than 26/30 (86.6%). The issue with the use of the 26/30 mastery threshold in reading research is that reading research indicates that learners need to be able to comprehend 98% of the tokens within a text to easily comprehend it (Schmitt et al., 2011). Furthermore, one purpose for giving levels tests is to match learners with lexically appropriate materials (McLean & Kramer, 2015; Webb et al, 2017), and research suggests that there are several purpose-dependent mastery thresholds.
When matching learners with reading materials through the application of the CCM, the purpose of the reading will determine the most appropriate lexical mastery threshold on levels tests, as well as coverage thresholds when profiling a text. Speed reading involves learners reading materials that contain no unknown words (Nation, 2007). Thus, a mastery threshold of 100% is necessary. Meaning-focused input materials (including extensive reading) require learners to know 98% of the tokens within them (Nation, 2007; Webb & Nation, 2017). Thus, an appropriate threshold is 98%. If the purpose is reading comprehension, then research suggests that an appropriate threshold is 95% (Laufer, 1989; Schmitt et al., 2011). If the purpose for reading is language-focused instruction, Stoeckel et al. (2020) and Schmitt et al. (2011) suggest a threshold no lower than 85%. While the precision of these figures might be questioned, if they are based on research they can be evaluated rationally. More important than the figures themselves, is authors, readers, reviewers, and editors evaluating and justifying the appropriateness of lexical thresholds based on their purpose.