When seeking data for our problem, we found that open-source detailed data on student-course enrollment at UVA was not available due to privacy and security concerns. Because of this we decided to create our own simplified data set to model reality, using the Computer Science Department as an example.
To do this, we created a list of 600 student observations and randomly assigned them a course year. This gave us an approximately even distribution of students and years for comparison. We then reviewed Lou’s list and selected a list of the UVA computer-science courses and most popular electives to use as our course list.
To generate students course rankings for the 10 classes offered, we recognized that student's selection of classes are not random and are often based on unobservable characteristics, such as interests and the grade distribution of the class. However, after observing Lou’s list and from personal experience, we have found that students are more likely to take courses assigned to their year of enrollment due to the pre-requisites needed for more advanced classes. This is not the case for all students, but we found that 1000 level courses typically consisted of over 50% first years and 4000 level courses had more 4th years than any other year. To account for this, we used a weighted random number generator that generated numbers between 0 and 1 and multiplied these observations by 2 if the year and course-level corresponded. These values were then ranked to create a final student-course preference list.
While our dataset does not perfectly model reality, we recognize that all models must trade-off between simplicity and accuracy. We believe that our dataset is effective in modeling reality given the information constraints present and the need for simplicity to minimize costs. Furthermore, our methods and models will provide similar results regardless of the methodology for student-course rankings, so the accuracy of our test-data set does not ultimately impact the evaluation of our models in a significant way. Since the aim of our research project is to gain a broader understanding of different scheduling algorithms instead of proving why our model is more effective than the current model, a more detailed data set was not needed to prove our results. In addition, with a larger data set that more accurately represents the entire UVA scheduling system, our algorithm would not be able to process all inputs into feasible outputs in a timely manner.