2015 Wang Heffernan & Heffernan

  1. Wang, Y., Heffernan, N, & Heffernan, C. (2015) Towards better affect detectors: effect of missing skills, class features and common wrong answers. Proceedings of the Fifth International Conference on Learning Analytics And Knowledge. pp 31-35. See data here and here

This is related to the other Affect related papers

FAQ

Heffernan was asked

We are trying to use the ASSITment data for a course project. We found the hand coded affect of 3075 students here https://drive.google.com/folderview?id=0B9MXO4ELrnzyUjdlVE1PWElHSE0&usp=sharing. We are looking at your recent paper “Towards better affect detectors: effect of missing skills, class features and common wrong answers” and would need the same dataset to work on. Our goal is to build better affect detectors. So, we plan to use this data as a training data for our model.

  1. We had a couple of questions for you -

  2. 1. This looks like a raw data and it has affect coded for only few rows of the problem log. Our assumption is that we use aggregator functions like min, max, mean and sum for the uncoded problem logs until the row with the coded affect, to transform this data to have affect labels for each row. This would then serve as the training data for our models. Is our understanding correct ?

  3. 2. We also see that the urban data only has 47 features whereas I read in your papers that there are 58 features which when aggregated would give 232. Are we looking at the right data ?

  4. It would be very helpful if you could share the transformed data used in your paper so we have a good baseline measures.

  5. Part 2

Hi Prof Neil and YuTao,

To help debug, here are the observations we had with the data shared -

1. Raw data (Total number of features) -

Urban - 19

Rural - 19

Suburban - 43

2. Combined and resampled data after aggregation (number of useful features) -

Confused - 204

Bored - 188

Frustrated - 204

Concentrating - 204

Bored is missing: SumsumHelp, SumsumofRightPerSkill, SumsumRight, SumsumTimePerSkill

But Bored still has other sum features: SumsumSameSkillWrongonFirstAttempt, SumsumTime3SDWhen3RowRight, SumsumTime5SDWhen5RowRight

We are primarily interested in the raw data. But we have 2 major issues -

1. with the differences in the number of features across urban, rural and suburban, we cannot combine the data efficiently.

2. even with combining missing data, the number of features would not match with what is mentioned in the paper hence we cannot reproduce your results to start with.

Your guidance would be greatly helpful !

To clarify a little, the (1) and (2) data sets that Shamya described are from the two links under the top paper on https://sites.google.com/site/assistmentsdata/home/2012-13-school-data-with-affect (the "here and here" part). As I understand it, the "rawest" data is the main data set on that page, but in order to distill that into the data set used to train the models in the paper, we would need to engineer the 58 features from the action field contents ourselves, which I think is a bit beyond the scope of our class project, unless there's a detailed description of each of the features somewhere.

Professor Heffernan and YuTao Wang responded with this ....

1. Feature extraction and affect detection methodology can be found in the original affect detection paper (link). Features came from various previous researches, and we used different piece of java code from them (link to java code).

2. The 58 features and 232 aggregated features were the numbers given by the original paper. Another student before me did the feature extraction work for the ASSISTments dataset, and shared with me the features and steps to get them. Since he didn't mention any change to the feature set, I did't noticed some of the features were missing. My guess is, since the original feature set was built upon the PSLC DataShop dataset some of the features might be hard to get or not useful for our dataset, also the java code might run into issues for some ill formatted data, and generated non-usefull features, so they were deleted as a preprocessing to feature selection.

These two docs might be helpful if you want to regenerating these features:

1) The steps that the student previously working on this shared with me: link.

2) Since the steps in 1) is not very clear (without input and output file format, and how the files transform between steps), I wrote a more detailed doc with all the assumptions I made for trying to go through this process on a sample dataset, but the assumptions are not guaranteed to be correct (my effort trying to replicate the exact features set from the raw dataset was not successful). I'm still putting it here for it might be helpful for you to understand the code a little bit better: document link, sample files link.

Since most part of my work can be done on the final feature set, I didn't spend too much time on feature extraction. As Ryan mentioned, with current data and code management status, perfect replication is not feasible.