Data

Training Data available: https://github.com/malvinanissim/gxg/tree/master/Data/Training

Overview of Datasets: https://github.com/malvinanissim/gxg/wiki/Data-description

In order to test the portability and stability of profiling models across genres, we need datasets of different sources, some of which could be closer to one another, as this would also provide interesting information on model portability.

We use data from the following genres:

Twitter
YouTube
Children writing
News/journalism
Personal diaries

For each genre we will have a portion of training and a portion of test data. The distribution of genders will be controlled for (50/50). We will also aim at providing datasets of comparable sizes, so as to avoid including training size as a relevant factor.

The datasets are composed by texts written by multiple users, with possibly multiple documents per user. The number of user per genre does not need to be balanced, nor does the number of documents per user.

Format

Ti sei spiegata benissimo complimenti

</doc>

Although there are five separate files, one per genre, we still include the genre information in the CSV so as to ease the combination of the different files, in case participants want to use the whole dataset at once.

Google Sites

Report abuse