Why does VocabLevelTest.org not use multiple-choice or YES/NO tests (reading/orthographic)?

Simply put both theory and data suggest that meaning-recognition (multiple-choice) tests do not measure the type of vocabulary knowledge learners need when reading as well as meaning-recall tests.


The theory


Stoeckel, T., Stewart, J., McLean, S., Ishii, T., Kramer, B., & Matsumoto, Y. (2019). The relationship of four variants of the Vocabulary Size Test to a criterion measure of meaning recall vocabulary knowledge. System, 87, 102161.


There is general agreement among second language (L2) vocabulary scholars that a fundamental aspect of lexical knowledge is the form-meaning link, or the ability to associate meaning with the spoken or written form of a word (Laufer & Goldstein, 2004; Nation, 2013; Schmitt, 2010). Schmitt (2010) has described a framework for assessing form-meaning knowledge in which distinctions are made regarding (a) the aspect of lexical knowledge that is assessed, word meaning or word form, and (b) whether the learner must recall this knowledge from memory or simply recognize it from a list of choices.

From these two dichotomies, four kinds of form-meaning knowledge are possible. For each of these, a brief description together with an example test item are shown in Table 1. Laufer and Goldstein (2004) observed a hierarchy of these kinds of form-meaning knowledge, in which L2 learners are likely to first acquire meaning-recognition, followed sequentially by form-recognition, meaning-recall, and form-recall. From the framework shown in Table 1, the types of lexical knowledge associated with reading are meaning-recall and meaning-recognition because in both of these, as in reading, there is first an encounter with the form of an L2 word, and the learner must then associate a meaning to the word. Studies that have compared the meaning-recognition and meaning-recall aspects of lexical knowledge have consistently found the former to be significantly easier than the latter (Gyllstad et al., 2015; Laufer & Goldstein, 2004; Zhang, 2013). Because of this and because the stated purpose of the VST is to measure the vocabulary knowledge needed for reading (Nation, 2012), it is worthwhile asking which of these two constructs is more relevant in reading. Nation and Webb (2011) have stated:

Sitting a multiple-choice vocabulary item is not like normal language use. When we read and meet an unknown word, we are not faced with given choices about its meaning. In this respect, a vocabulary translation test is more like normal language use and is closer to the difficulty level of normal language use (Waring and Takaki, 2003). (p. 286)

Adding to this, Kremmel and Schmitt (2016) assert:

Fluent reading (and listening) requires quick recognition of the word form, and automatic recall and retrieval of the corresponding meaning, so that cognitive resources can be applied to meaning construction from the text (Grabe, 2009). Thus, vocabulary needs to be known to the meaning recall level (Schmitt, 2010) to reach lexical employability (i.e., make fluent reading possible). In a reading situation the authentic task for a learner to perform is to recall the meaning of the word form they are exposed to without any help or meaning options to choose from. But matching and multiple-choice items are recognition formats, where options are given and must be selected from. Such recognition formats are clearly incongruent with real-world reading, because no book provides multiple definitions to choose from for (unknown) words in the text (Nation & Webb, 2011). (p. 378)

The evidence


Text taken from Zhang, S., & Zhang, X. (2020). The relationship between vocabulary knowledge and L2 reading/listening comprehension: A meta-analysis. Language Teaching Research, 1362168820913998.


This study set out to investigate the relationship between L2 vocabulary knowledge (VK) and second-language (L2) reading/listening comprehension. More than 100 individual studies were included in this meta-analysis, which generated 276 effect sizes from a sample of almost 21,000 learners. The current meta-analysis had several major findings. First, the overall correlation between VK and L2 reading comprehension was .57 (p < .01) and that between VK and L2 listening was .56 (p < .01). If the attenuation effect due to reliability of measures was taken into consideration, the ‘true’ correlation between VK and L2 reading/listening comprehension may likely fall within the range of .56–.67, accounting for 31%–45% variance in L2 comprehension. Second, all three mastery levels of form–meaning knowledge (meaning recognition, meaning recall, form recall) had moderate to high correlations with L2 reading and L2 listening. However, meaning recall knowledge had the strongest correlation with L2 reading comprehension and form recall had the strongest correlation with L2 listening comprehension, suggesting that different mastery levels of VK may contribute differently to L2 comprehension in different modalities.


a Form–meaning knowledge. Since VK has multiple constructs (Nation, 2001), moderator variable analysis was performed to compare the effect of different types of VK on L2 reading and listening, including three mastery levels of form–meaning knowledge: meaning recognition, meaning recall, and form recall. All three types of form–meaning knowledge had moderate to high correlations with L2 reading comprehension, with meaning recall having the strongest correlation (r = .66, p < .01), followed by form recall (r = .55, p < .01), and meaning recognition (r = .53, p < .01) . Regarding the question of which type of form–meaning knowledge is most critical for L2 reading comprehension, the result suggests that meaning recall may be the most important type of form–meaning knowledge for L2 reading as it had a higher correlation with L2 reading comprehension than meaning recognition (Qbetween = 9.44, p < .01) and form recall (Qbetween = 5.88, p < .05). Indeed, among various types of VK surveyed in the current meta-analysis (including vocabulary depth knowledge), meaning recall knowledge explained the largest proportion of variance in L2 reading comprehension (43.6%).




Text taken from McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach. Language Testing, 37(3), 389-411.


Abstract

Vocabulary’s relationship to reading proficiency is frequently cited as a justification for the assessment of L2 written receptive vocabulary knowledge. However, to date, there has been relatively little research regarding which modalities of vocabulary knowledge have the strongest correlations to reading proficiency, and observed differences have often been statistically non-significant. The present research employs a bootstrapping approach to reach a clearer understanding of relationships between various modalities of vocabulary knowledge to reading proficiency. Test-takers (N = 103) answered 1000 vocabulary test items spanning the third 1000 most frequent English words in the New General Service List corpus (Browne, Culligan, & Phillips, 2013). Items were answered under four modalities: Yes/No checklists, form recall, meaning recall, and meaning recognition. These pools of test items were then sampled with replacement to create 1000 simulated tests ranging in length from five to 200 items and the results were correlated to the Test of English for International Communication (TOEIC.) Reading scores. For all examined test lengths, meaning-recall vocabulary tests had the highest average correlations to reading proficiency, followed by form-recall vocabulary tests. The results indicated that tests of vocabulary recall are stronger predictors of reading proficiency than tests of vocabulary recognition, despite the theoretically closer relationship of vocabulary recognition to reading.


Correlations to reading proficiency by test length

In order to answer RQ2, the various test forms were correlated to learners’ TOEIC Reading section scores. Figure 1 and Table 5 depict the average correlations of the four test modalities to reading proficiency (vertical axis) as a function of test length (horizontal axis). The Supplementary File shows scatterplots illustrating mean correlations for various test modalities and tests lengths. For all test lengths, meaning-recall tests had the highest average correlation to reading ability, followed by form recall. For tests under 30 items, Yes/No tests had a slightly higher mean correlation to reading than meaning-recognition tests. However, for tests of 30 items or more, meaning-recognition tests pulled ahead. After this point, Yes/No tests held the weakest correlations to reading proficiency of all modalities examined, despite boasting the highest internal reliability. All observed correlations were significant (p < .001). Table 5 displays a limited number of the mean Pearson’s correlations of the four vocabulary formats to reading proficiency for various item length calculated, and the Supplementary File displays all 84 of them.

Differences in distributions of correlations between reading proficiency and modality

We conducted ANOVAs for 40- and 100-item versions of the tests in order to establish the significance of the difference in correlations. Both were statistically significant [F(3, 3996) = 4074, p < .001; F(3, 3996) = 11164, p < .001)] and Tukey post-hoc tests indicated differences were significant between all modalities for both test lengths (p < .001).

Effect sizes and mean differences can be seen below in Tables 6 and 7. Using Plonsky and Oswald’s ( 2014) empirically based cutoffs for L2 research using within-subject designs, a Cohen’s d effect size of 1.00 can be considered to be “moderate” and 1.40 or higher can be considered to be “large.” As such, the listed effect sizes are almost uniformly “large.” The sole exception, an effect size of 0.901 for the difference between meaning recognition and Yes/No modalities at a length of 40 items, rises to an effect size of 2.434 when test lengths are increased to 100 items.


Relationship between test time and correlation to reading

Finally, to answer RQ3, the relationship between the time required to take tests in each modality and those tests’ correlations to reading proficiency was examined. The time required to take such tests is an important consideration for learners, researchers, and educators. Even if tests with Yes/ No and meaning-recognition modalities have lower correlations to reading proficiency than meaning-recall tests of the same length, since learners can complete Yes/No or meaning-recognition tests at a faster rate, they may be able to take longer tests within a given time period, which could potentially yield higher correlations to reading than tests using more time-intensive item modalities.

Table 8 shows the mean number of test items that the 103 participants can complete in various time periods under each test modality. Using this information, correlations to reading proficiency were examined again using time as a variable rather than item counts (Figure 4 and Table 9).

Although Yes/No items required the least time to complete, savings in time did not appear to positively affect correlations to reading; correlations to Yes/No modality tests consistently lagged the other modalities examined and reached a near peak of approximately .67 after 20 minutes, with only marginal increases after this point. In contrast, at the 20-minute mark correlations to meaning recognition, form recall and meaning recall were higher at .71, .75, and .78, respectively. The correlations of these three modalities began to peak at 30 minutes, with correlations of .72, .76, and .79, respectively.