Data

In order to test the portability and stability of profiling models across genres, we need datasets of different sources, some of which could be closer to one another, as this would also provide interesting information on model portability.

We use data from the following genres:

        • Twitter
        • YouTube
        • Children writing
        • News/journalism
        • Personal diaries

For each genre we will have a portion of training and a portion of test data. The distribution of genders will be controlled for (50/50). We will also aim at providing datasets of comparable sizes, so as to avoid including training size as a relevant factor.

The datasets are composed by texts written by multiple users, with possibly multiple documents per user. The number of user per genre does not need to be balanced, nor does the number of documents per user.

Format

The data is distributed in the form of one XML-like file per genre with one sample per elements, and attributes specifying an id, the genre (children|diary|journalism|twitter|youtube), and the gender (F|M). This is a sample:

<doc id="3140" genre="youtube" gender="M">

Ti sei spiegata benissimo complimenti

</doc>

Although there are five separate files, one per genre, we still include the genre information in the CSV so as to ease the combination of the different files, in case participants want to use the whole dataset at once.