MC Answer Boxes Dataset

The Achievement of Higher Flexibility in Multiple Choice-based Tests Using Image Classification Techniques

Mahmoud Afifi(1,2) and Khaled F. Hussain(2)

1 York University, Canada 2 Assiut University, Egypt

Introduction

In spite of the high accuracy of the existing optical mark reading (OMR) systems and devices, a few restrictions remain existent. In this work, we aim to reduce the restrictions of multiple choice questions (MCQ) within tests. Unlike other systems that rely on simple image processing steps to recognize the extracted answer boxes, we address the problem from another perspective by using machine learning techniques to classify the answer box. All existing machine learning techniques require a large number of examples in order to train the classifier, therefore we present a dataset that consists of five real MCQ tests and a quiz that have different answer sheet templates.

Our dataset contains 6 different real multiple choice (MC)-based exams (735 answer sheets and 33,540 answer boxes) that are FREE for reasonable academic fair use. The dataset is presented to evaluate computer vision techniques and systems developed for MC test assessment systems.

The main features of the MC Answer Boxes Dataset are:

Free for reasonable academic fair use.
Real MC tests.
Different templates.
Cover three types of answers (confirmed, crossed out (canceled), and empty (blank) answer boxes)
Detailed metadata is provided

The dataset comprises five real MC tests and one quiz that were held by three faculties in Assiut University: Faculty of Social Work, Faculty of Specific Education, and Faculty of Computers and Information. The model answer of each exam is supported. The documents have been scanned using HP Scanjet Enterprise Flow N9120 Flatbed Scanner. The scanned documents had been saved in XML Paper Specification (XPS) file format, thereafter the XPS documents were converted to PNG images. Both sets (i.e. files XPS and PNG files) are available free for download. The tests have different styles as shown in Figure 1.

Figure 1: Samples of the answer sheets

The metadata of each exam was created manually by 7 volunteers. The ROI's of answer boxes were determined in a semi-automated way by specifying the answer boxes on the attached model answer sheet. Following, was the image alignment of the answer sheets with the reference image (i.e. the image of the model answer sheet) to avoid any misalignments during the scanning process. The volunteers determined the type of each extracted ROI (i.e. answer box) whether it is confirmed, crossed out (canceled), or empty (blank) answer. There is additional information reported for each exam, such as the written language, the correct answer, the student ID location, number of pages, the page number of the current PNG image, and the grade of each answer sheet.

Table 1: Summary of the five exams and the quiz

Tabel 2: Description of XPS version of the dataset

Formatting

As mentioned earlier, the dataset comes in two formats: 1- XPS documents and 2- PNG images. The first version consists of 13 XPS files. Each PNG image represents individual pages of an answer sheet. The filename contains the exam number, the number of answer sheet, and the page number. For example "exam0_13_2" refers to page number 2 of the answer sheet number 13 of exam0. The metadata describes each answer sheet by a set of variables whose details are elaborated in the paper.

The paper

The paper is available here.

Citation

Cite as:

Mahmoud Afifi and Khaled F. Hussain, “Towards More Flexibility in Multiple Choice-based Tests Using Image Classification Techniques", International Journal on Document Analysis and Recognition (IJDAR), 2019.

Bibtex:

@article{afifi2019MCQ,

title={The Achievement of Higher Flexibility in Multiple Choice-based Tests Using Image Classification Techniques},

author={Afifi, Mahmoud and Khaled F. Hussain},

journal={International Journal on Document Analysis and Recognition},

year={2019}

}

Download

Answer sheets (6 exams) | 527 MB: Download
Original XPS files of the answer sheets (6 exams) | 90.7 MB: Download
Model Answers (6 exams) | 4.84 MB: Download
Extract ROI (answer boxes) from answer sheets | 2KB: Download. Before running the code, copy all model answers in a directory and name it "images" before running or download all confirmed, crossed out, empty answer boxes from the following links
All confirmed answer boxes | 137 MB : Download
All crossed out answer boxes | 2.22 MB: Download
All empty answer boxes | 65.4 MB: Download
5-fold cross-validation sets for classification | 90.1 MB: Download
Metadata and ground-truth of answer sheets | 449 KB: Download
Metadata and ground-truth of model answers | 3.56 KB: Download
Source code (Matlab) | 20 KB: Download, the guide file (txt) | 3 KB: Download
GUI demo (Matlab) | 7,606 KB: Download, the guide file (txt) | 2 KB: Download
New set for cross-dataset evaluation | 22.2 MB: Download
Baseline model for classification | 202 KB: Download
Source code for training the baseline model (Matlab) | 1.98 KB: Download
Demo code to report the classification results, e.g., accuracy, recall, ... etc. (Matlab) | 4.39 KB: Download

Contact us

Questions and comments can be sent to:

m.afifi[at]aun[dot]edu[dot]eg or mafifi[at]eecs[dot]yorku[dot]ca

Results

This section presents our results and the top results obtained using the dataset. For comparisons, we suggest to follow the same evaluation criteria presented in the paper for the grading accuracy and to use the 5-fold cross validation sets for classification accuracy (you can find it above).

[1] H. Deng, F. Wang, and B. Liang, “A low-cost omr solution for educational applications,” in Parallel and Distributed Processing with Applications, 2008. ISPA’08. International Symposium on. IEEE, 2008, pp. 967–970.

[2] P. Sanguansat, “Robust and low-cost optical mark recognition for automated data entry,” in Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), 2015 12th International Conference on. IEEE, 2015, pp. 1–5.

[3] Mahmoud Afifi and Khaled F. Hussain, “The Achievement of Higher Flexibility in Multiple Choice-based Tests Using Image Classification Techniques,” International Journal on Document Analysis and Recognition (IJDAR), 2019.

Acknowledgement

We would like to thank Rana Mostafa, Tasneem Ahmed, Lamiaa Mohsen, Heba Hashem, Amera Talaat, Noha Hassan, and Alaa Saad for their help in generating metadata of our dataset.

They are provided here for reasonable academic fair use.

Google Sites

Report abuse