Bibliographies

Language Assessment Literacy

Deygers, B., & Malone, M. E. (2019). Language assessment literacy in university admission policies, or the dialogue that isn’t. Language Testing, 36(3), 347-368. https://doi.org/jjvf

The authors provide valuable insight into how language requirements are set at the government and institutional level in Flanders, Belgium. The authors conduct interviews with senior government administrators and the policy makers at five Dutch-medium universities in Flanders. Some key findings of the study for me are that policy makers and language testers have different views regarding language requirements, change is enacted at the university and not government level, exceptions to requirements can be made based on multiple factors, administrators prioritize documents that can help them with the registration process, and language testers need to learn policy literacy just as administrators and policy makers need to learn language assessment literacy. Language testers often care more about making evidence-based decisions whereas policy makers care more about making political compromises and finding solutions to real-world problems. Exceptions to requirements can be made based on powerful faculty or internal actors, competition with neighboring universities, changes in student demographics, or the need for more students.

Repeated Test-Taking

Green, T., & Van Moere, A. (2020). Repeated test-taking and longitudinal test score analysis. Language Testing, 37(4), 475-481. doi: 10.1177/0265532220934202

Green and Van Moere (2020) write to introduce the collection of articles in this special issue. Articles in the special issue are about repeated test-taking (RTT) and issues associated with it. Some of the themes that emerge from the articles in the special issue include RTT as an equity issue (e.g., not everyone can afford to take tests multiple times or may have limited access to test centers), how test-takers' and test users' views of RTT may differ (e.g., students probably just want to "pass" the test), and ways of analyzing test data so that decisions that are made based on test scores are more equitable. The authors also introduce the term "super-scoring" (p. 479) and note that "changes" in scores across administrations may be due to measurement error.

Kokhan, K., & Lin, C.-K. (2014). Test of English as a Foreign Language (TOEFL): Interpretation of multiple score reports for ESL placement. Papers in Language Testing and Assessment, 3(1), 1-22. https://bit.ly/3PsfATx

The authors' goal in this study was to provide some guidelines for score users at universities about what to do when students provide multiple score reports. They note that one challenge as a score user is that test providers often do not provide any recommendations about what to do with multiple score reports. The authors therefore explored different ways of combining scores from repeat administrations, including taking the average, highest, latest, or self-reported scores. They examined if there was any difference between the score type and looked at the relationship between score type and test-takers' subsequent placement into ESL writing courses at their university (Levels 1--3). They found a difference between score type (averaged scores were lowest), that highest and self-report scores had the strongest relationship with ESL placement, and that highest and self-report scores had the best classification efficiency. However, all score types had weak predictive power.

Monfils, L. F., & Manna, V. F. (2021). Time to achieving a designated criterion score level: A survival analysis study of test taker performance on the TOEFL iBT test. Language Testing, 38(1), 154-176. https://doi.org/gmczmp

The authors investigated how long it takes test-takers to reach CEFR B2 and C1 criterion scores by analyzing the TOEFL iBT scores of about 100,000 test-takers. The authors noted that TOEFL iBT scores of 72 and 95 correspond to CEFR B2 and C1 benchmarks. The authors also looked at the extent to which gender, age, time spent studying English, time spent in English content class, time spent in English language country, reason for taking TOEFL, language group, and number of test sittings affect likelihood of achieving criterion levels. The authors found that larger percentage of test-takers were more likely to hit B2 level at second sitting (75%) than C1 (25%), and doing so took less time for the B2 (78 days) than C1 (92) levels. The authors also found that likelihood of hitting minimum levels was affected by language group (NIE vs. IE), time spent studying English, graduate vs. undergraduate program, and business and non-business programs. Females were less likely to hit C1 than males (more investigation needed here), but chances were about the same for B2. Findings support the claim that less time is needed to achieve lower levels of proficiency than higher ones. This study was one of the first to use survival analysis to analyze scores of international EFL test-takers.

Zhang, Y. (2008). Repeater analyses for TOEFL iBT. ETS Research Memorandum, ETS RM-08-05. https://bit.ly/3W2asaR

Zhang examined the relationship between the scores of test-takers who sat the TOEFL iBT two times within a month in 2007. The scores of about 12,300 repeaters, who self-selected for the study, were analyzed; about 30 with large score differences were identified as outliers and removed from the data set. Zhang found differences in mean scores on each of the four (reading, writing, listening, and speaking) sections of the TOEFL iBT, medium-strong correlations between scores from the first and second sittings in all sections, and small effect sizes of the mean-score differences. Zhang noted that the small effect sizes were an indication that differences between scores from the first and second sittings, taken within one month, were minimal.

Test Prep or Coaching Centers

Hu, R., & Trenkic, D. (2019). The effects of coaching and repeated test-taking on Chinese candidates' IELTS scores, their English proficiency, and subsequent academic achievement. International Journal of Bilingual Education and Bilingualism, 24(10), 1486-1501. https://doi.org/gm3zms

The authors of this study investigated how students who were enrolled in IELTS test-prep courses performed on (1) other measures of English proficiency and (2) their academic programs, compared to students who were not enrolled in IELTS test-prep courses. Students were incoming international students from China who were enrolling in one-year master's programs in the UK; some programs were considered more demanding (e.g., those in English) than others (e.g., engineering). About half of the 153 students had completed test-prep courses, most were female, and most had completed test-prep courses months before applying for admission. Students were given Duolingo and English C-test tests at time of admission, and their performance in academic programs was their weighted grade at the end of their master's programs. Authors controlled for working memory and non-verbal communication. Some of the authors major findings were that (a) the test-prep group scored lower on Duolingo and C-test compared to non-test-prep group, indicating inflated IELTS scores without increase manifest in other measures, (b) there was a negative relationship between number of IELTS sittings and score gains within half bands, and (c) "demand" of program predicted difference in program grades (4 points for less demanding programs and 9 for more demanding ones), meaning we should consider program demand when setting minimum score requirements.

General English Assessments and University Admissions

Feast, V. (2002). The impact of IELTS scores on performance at university. International Education Journal, 3(4), 70-85. https://bit.ly/3H55O7J

Feast looked at the relationship between IELTS scores and the academic performance of international students at Australian universities. Feast noted that many studies have been done that looked at the relationship between IELTS scores and academic performance (most having found negative to moderate relationships) but noted that institutions should conduct their own studies to determine appropriate cut-score levels that are used to inform admissions decisions. Feast used a multilevel hierarchical linear model to analyze IELTS and academic performance (GPA) data; GPA scores were collected across five semesters or occasions. After controlling for the effect of occasion in the model, Feast looked at impact of other variables on relationship between IELTS scores and GPA, including students' program, age, gender, level of study, and so on. Feast found an effect of .39 after accounting for occasion; all other variables had small to moderate effects. Feast also explored five methods for altering cut-score levels and looked at what those changes would mean in terms of GPA improvements and the proportion of international students would not have been admitted.

Settles, B. (2016). The reliability of Duolingo English Test scores. Duolingo Research Report, DRR-16-02. https://bit.ly/3Hodp1d

Settles reports on the reliability of the scores of the Duolingo English Test's certified and practice tests (scores analyzed were from a slightly older version). Settles looked at several sources of error in DET test scores, including error due to items (split-half reliability and Hashing's alpha), measurement error as a function of the ability level of test-takers (conditional SEM), and error due to time-related and trait factors (test-retest reliability and the effect of the mean difference of scores of those who sat the DET twice within 30 days). Split-half and Hashing's alpha estimates were high (certified: .96 and .93; practice: .83 and .81), more error is found in extreme ranges (<20 and >85) of DET operational test-score range (1-100), test-retest reliability was high (.84), and the effect of the mean-score difference of those who sat the test twice within 30 days was low (Cohen's d = .10). Settles noted that small gains in mean scores from first and second DET sittings likely due to practice effect, but the small effect indicates that differences are negligible.

Standard Setting

Fulcher, G., & Svalberg, A. (2013). Limited aspects of reality: Frames of reference in language assessment. International Journal of English Studies, 13(2), 1-19.

Fulcher and Svalberg discuss the relationship between (1) setting cut scores on a test and (2) norm-referenced testing (NRT) and criterion-referenced testing (CRT). In particular, they emphasize that setting a cut score on a CRT is not the same thing as defining an actual point on a continuum of skills or tasks that make up the domain of the CRT. They argue that setting a cut score on a CRT without considering the underlying skills or tasks that make up the CRT domain (e.g., by linking scores on a CRT to an outside frame of reference, such as the CEFR framework or some other, unrelated set of descriptions that define a different domain) corrupts our understanding of what is meant by "criterion" in a CRT. Crucially, a CRT, which is particularly suited to specific-purposes testing and whose tasks gauge readiness for the workplace, is defined by a specific set of tasks or levels of a skill that are needed to accomplish workplace tasks successfully or perform skills at proficiencies that correspond to different levels of competency (e.g., novice, skilled, master). Only with a clear understanding of the domain of a CRT can we interpret the relationship between a cut score on the CRT with the tasks or levels of competency a student can complete or has mastered. Retrofitting an NRT domain onto a CRT by defining a cut score on the CRT, with the NRT, and through a standard-setting study is probably nonsense (e.g., linking CRT performance to levels of competency reflected by bands of TOEFL scores). The CRT domain is "validated" through the process of designing, revising, and using test specifications, which link the CRT domain to actual test items.

Griffee, D. T., & Gevara, J. R. (2011). Standard setting in the post-modern era for an ITA performance test. Texas Papers in Foreign Language Education, 15(1), 4-17.

In this study, Griffee and Gevara discussed a standard-setting study that they conducted to set cut scores for an International Teaching Assistant (ITA) performance test. They were motivated to carry out the standard-setting study because the cut scores that were previously used were based on teachers' intuition of the point at or above which incoming international graduate (master's and doctoral) students needed to perform to be eligible to teach students in classes and labs. The authors used the Contrasting Groups (Livingston & Zieky, 1982) method to set cut scores for their ITA test. Their ITA test consisted of ITA candidates' giving a presentation whose performance is then rated by two raters on ten criteria; each criterion was scored out of 5. ITA candidates' performances were classified as "master" or "non-master" by the program director, and the ITA test scores for these two groups (out of 50) were plotted against each other. The authors identified two cut-score points of 39 and 41. Ultimately, the authors decided to go with the slightly lower cut score of 39 because a cut score of 41 would have meant few ITA candidates' being considered eligible for positions as teaching assistants. Notably, the authors discussed the politics associated with standard-setting studies, talking about how, in their case, department faculty wanted a lower cut scores so that they have more ITAs to teach classes and labs, but undergraduate students, who are taught by ITAs, wanted a higher cut score because their academic performance may be affected by ITAs who may be less qualified to or comprehensible when teaching.

Verhelst, N., Figueras, N., Prokhorova, E., Takala, S., & Timofeeva, T. (2019). Standard setting for writing and speaking: The Saint Petersburg experience. Developments in Language Education: A Memorial Volume in Honour of Sauli Takala, 278-302.

The authors of this study report on a series of workshops to set cut-score standards for an English test at Saint Petersburg State University (SPSU) in Russia. The English test was used as an exit test by students who were graduating from bachelor's degree programs in English at SPSU. The study was conducted to gather evidence to help support the claim that students who graduated with their bachelor's degrees in English were performing at the CEFR B2 level. The real meat of this study can be found in the authors' description of the standard-setting method, the Body of Work (BoW) method (Kingston, Kahl, Sweeny, & Bay, 2001), that they used to set cut scores on the speaking and writing portions of the English test. The study included a number of judges, who were grouped into both internal and external panels. Judges were given a set of speaking and writing performances to examine, and they were asked, "Is this performance at the CEFR B2 level?" Each speaking and writing performance had a rating attached to it, which was not shared with judges. The authors note that the BoW method can be tricky, involving a lot of steps and data or task requirements (see Cizek & Bunch, 2007); they made a few modifications to the BoW method they used (e.g., they did not include a range-finding stage). The were able to pinpoint cut scores by examining the results of several analyses, such as by plotting performance scores against logit values of those scores and taking the point at which a performance had a 50-50 chance of being rated above or below the B2 level (logit = 0). This study is a great example of a rigorous standard-setting study in which researchers made informed decisions "on the ground" to adapt steps in the BoW method to their own test, participants, and circumstances.