A Quantitative Option
The potential for quantitative item comparison
Quantitative Comparison of Chatbot Assessment Items
Designing an analytical study to measure how well dynamic, conversation-based assessment questions perform in relation to static assessment questions is a viable alternative to documenting the design process itself. It is a natural extension of the topic, and it will be essential to the long-term success of the chatbot construct as assessment. Chatbot questions could be compared to either selected response or constructed response questions, although they more closely resemble constructed response items. A study in this domain could include examination of both reliability and sensitivity to instruction, both of which would contribute to a measurement of validity or identifying areas for further development of the item type.
Reliability refers to the likelihood that a test item will elicit the same response each time that it is administered. With a natural language chatbot, there is an added complication of dynamic conversation. The chatbot might produce a slightly different conversation each time it is used, even by the same students. Design considerations must take this into account and build in assurances that the student will have the opportunity to meet the requirements of the task in an equitable way. Demonstrating the chatbot's ability to perform consistently creates a slightly more complex examination of reliability than traditional test items merit.
There is growing effort to construct and test items designed to measure against the Next Generation Science Standards with reliability and sensitivity to instruction, which in turn establishes validity. (Penuel, Harris, & DeBarger, 2015) One possible method might be to convert an item with an established validity into a chatbot-based conversational question and compare performance in the different format.
Assessing the Next Generation Science Standards is a challenging task. Items must not only measure recall and understanding of a body of knowledge, but they must also evaluate how well a student can apply facts to explain, justify, reason, and more. These unique challenges require looking beyond a multiple choice item type. James Pellegrino (2015) describes one possible process by which item could be developed with validity in mind. He advocates the development of an assessment argument, which consists of a set of relationships between what he refers to as claims, evidence, and tasks. His term claims refers to what we want to be able to say that students know and can do. The evidence is what would demonstrate to us that a student has the knowledge and skills. The task is what we would consider the item question or prompt, or commonly referred to as a learning performance. Once these claims, evidences, and learning performances are established, rubrics are developed by which the responses can be evaluated.