GENERATIVE ARTIFICIAL INTELLIGENCE TOOLS
Co-authored with Charlene Polio
This chapter discusses how generative artificial intelligence (GenAI) tools, particularly large language models (LLMs) like ChatGPT, are emerging as powerful web-based tools for research purposes, such as data analysis, in applied linguistics research. While much attention has focused on pedagogical applications, we review how GenAI can be leveraged to support various stages of the research processes in empirical studies, such as instrument design, automated coding, text annotation, and qualitative data analysis. We address key concerns around validity and reliability as well as ethical considerations related to transparency, data privacy, and potential bias in AI-generated output. Given that GenAI is in the early stage of research application, we describe its current capacities and limitations based on emerging empirical research and propose promising directions for future studies.
OPTIMIZING AI FOR ASSESSING L2 WRITING ACCURACY: AN EXPLORATION OF TEMPERATURES AND PROMPTS
Co-authored with Charlene Polio and Adam Pfau
This study investigates the impact of temperature and prompt settings on ChatGPT-4 in assessing second language (L2) writing accuracy. Building on Pfau et al. (2023), we used a corpus of 100 essays by L2 writers of English and examined how three temperature settings (0, 0.7, 1) and two prompt types (defined, undefined) influenced ChatGPT-4’s performance in error detection compared to human coding. Results indicated that ChatGPT-4, while generally underestimating error counts compared to human coders, showed a strong positive correlation with human coding across various settings. Notably, prompts with a detailed definition of errors yielded higher correlation coefficients (ρ = 0.826 to 0.859) than those without (ρ = 0.692 to 0.702), suggesting that more detailed prompts enhance ChatGPT-4’s performance. Descriptive statistics showed that with a less-detailed prompt, the error detection ability of ChatGPT-4 was nearly identical across temperature settings, yet with a more detailed prompt, ChatGPT-4’s performance was slightly better at higher temperatures. We discuss the importance of temperature in relation to prompt specificity for reliable L2 writing accuracy assessment and provide suggestions for optimizing AI tools such as ChatGPT-4 for assessing L2 writing accuracy.
EXPLORING THE POTENTIAL OF CHATGPT IN ASSESSING L2 WRITING ACCURACY FOR RESEARCH PURPOSES
Co-authored with Adam Pfau and Charlene Polio
Research Methods in Applied Linguistics, 2023
This study investigates ChatGPT's potential for measuring linguistic accuracy in second language writing for research purposes. We processed 100 L2 essays across five proficiency levels with ChatGPT-4 and manually coded for precision and recall with regard to ChatGPT's identification of errors. Our findings indicate a strong correlation (ρ = 0.97 using one method and .94 using another method) between ChatGPT's error detection and human coding, although this correlation diminishes with lower proficiency levels. While ChatGPT infrequently misidentifies errors, it often underestimates the total error count. The study also highlights ChatGPT's limitations, such as the issue of consistency, and provides guidelines for future research applications.