Like so much of testing the answer to "How many items should I have per band?" will depend on the purpose for testing and setting. However, in general ...
The more the better. Or the more items per band, the better the data will represent learners’ knowledge of all the words in the target band.
The representativeness of a sample is determined by the number of items much more than the band size.
The more items per band, the longer the test.
Where possible testing items representing only a single band with many (between 30 -200 items) is preferable.
The increase in the representativeness and accuracy of your sample decreases as the number of items increases.
Where possible, I (Stuart McLean) try to represent a band with at least 40 items.
Articles that have investigated the representativeness, reliability (internal consistency) and/or accuracy of vocabulary knowledge estimates from vocabulary levels tests.
Gyllstad, H., McLean, S., & Stewart, J. (2020). Using confidence intervals to determine adequate item sample sizes for vocabulary tests: An essential but overlooked practice. Language Testing. https://doi.org/10.1177/0265532220979562
McLean, S., Stewart, J., & Batty, A. O. (2020). Predicting L2 reading proficiency with modalities of vocabulary knowledge: A bootstrapping approach. Language Testing, 37(3), 389-411. https://doi.org/10.1177/0265532219898380
Stoeckel, T., McLean, S., & Nation, P. (2020). Limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 1-23. https://doi.org/10.1017/S027226312000025X
Gyllstad, H., Vilkaitė, L., & Schmitt, N. (2015). Assessing vocabulary size through multiple-choice formats: Issues with guessing and sampling rates. ITL-International Journal of Applied Linguistics, 166(2), 278-306. https://doi.org/10.1075/itl.166.2.04gyl
Extracts from Stoeckel, T., McLean, S., & Nation, P. (2020). Limitations of size and levels tests of written receptive vocabulary knowledge. Studies in Second Language Acquisition, 1-23. https://doi.org/10.1017/S027226312000025X
We now turn to the issue of target word sample size. In the development of vocabulary size and levels tests, target items are sampled from large sets of words, often 1,000-word frequency-based bands. To achieve high estimates of internal reliability and a strong correlation with a criterion measure, it appears that a target word sample size of 30 items is sufficient (Gyllstad et al., 2015; Schmitt, et al., 2001).
A separate question, however, is how well a sample represents the population of words from which it was drawn. Vocabulary knowledge as a tested construct differs from assessment in many other areas of learning. Lexical knowledge is not a skill that, once acquired, can be applied to understand unrelated new words. Knowledge of lexis is built word by word, differing for example from basic addition and subtraction, areas of learning for which demonstrated mastery in a well-made test can be interpreted as the ability to solve a universe of similar problems. Thus, there are limitations regarding the inferences that can be made from knowledge of a random sample taken from a larger set of words (Gyllstad et al., 2015). A high estimate of internal reliability on a size or levels test indicates that the instrument can reliably measure knowledge of the target words in the test. Likewise, a strong correlation between scores on a vocabulary test and a criterion measure of the same words indicates that there is a strong relationship between measures of only the sampled words. This is separate from whether test scores accurately represent knowledge of an entire word band. Such distinctions are important because if the small number of words that are assessed is significantly more (or less) likely to be known than the population of words from which they were sampled, results will systematically over- (or under-) estimate vocabulary knowledge even if estimates of internal reliability or correlations with a criterion measure are high.
Perhaps this issue has received little attention because of an implicit assumption that the items within a frequency-derived word band are of similar difficulty. If this were the case, it would not matter which words were selected; they would be equally representative of the entire band. The empirical evidence does not support this assumption, however. Though mean scores on large, frequency-based word bands decrease with frequency (Aizawa, 2006; Beglar, 2010; Brown, 2013; Milton, 2007), there is considerable variation in difficulty for individual words within frequency bands (Beglar, 2010; Bennett & Stoeckel, 2014). This calls into question the accuracy with which small samples represent the average difficulty of a large population of words.
Though vocabulary levels and size tests are commonly used to estimate the total number of words known (e.g., Webb & Chang, 2012) or to determine level mastery (e.g., Chang, 2012), confidence intervals (CI) for individual scores are rarely if ever reported, and the number of items required for desired levels of confidence is under-explored. Using meaning-recall data from all 1,000 words at the 3K frequency level, Gyllstad et al. (2019) found that for a test consisting of 20 randomly selected items, the 95% CI was as much as 400 (i.e., ±200) of the 1,000 words assessed. Thus, a learner who correctly answers 10 out of the 20 items (50%) might know anywhere between 300 and 700 items in the 1,000-word band. The maximum 95% CIs for 30, 50, and 100-item tests were 326, 248, and 172 words, respectively.
Confidence intervals can also be estimated based on sample size and the proportion of correct responses. Using this approach, we calculated Clopper-Pearson (Clopper & Pearson, 1934) CIs for a hypothetical test with different target word sample sizes from a 1,000-word band. The Clopper-Pearson method is appropriate when the proportion of correct responses approaches or reaches 1 (McCracken & Looney, 2017), which corresponds with common mastery criteria for levels tests. We calculated CIs for two scoring outcomes. The first is at a score of 50% because this is where the CI is widest, revealing the largest potential difference between test scores and actual knowledge. The second is where scores approach and reach 100% because this is where mastery has been defined in levels tests, and it is where CIs are smallest.
The values in Table 10, consistent with Gyllstad et al. (2019), indicate that when levels tests are used to estimate the number of words that a learner knows in a level (rather than complete mastery), massive error is possible for scores around 50%. Although the CIs for perfect or near-perfect scores are narrower, they are unsatisfactory when we consider the importance of knowing 95 or 98–99% of the words in a text. The most items per level in any existing levels test of the form-meaning link is 30 in both the VLT and UVLT, with the strictest mastery criterion set at 29 for the 1K and 2K levels of the UVLT. The 95% CI for that score (828–999 words) or even for a perfect score (884–1,000 words) casts doubt on whether a mastery score consistently corresponds with knowledge of 95% or more of the words in the level. The values in Table 10 also suggest that the current approach to test construction, in which items are randomly sampled, may be untenable for the needed level of precision for many testing purposes, even when a 90% CI is used. When a learner achieves a perfect score on a test of 100 items sampled from a 1,000-word band, the 90% CI falls outside of the level of 98% coverage. It is hard to imagine how a multiple-level test with 100-plus items per level could be practically used in most educational settings. For existing instruments, combining items from multiple test forms and, when time is limited, administering fewer than the full complement of levels, is a good way to increase precision.