The ARRAU Corpus is a corpus annotated for anaphoric information and focusing in particular on the ‘difficult’ cases of anaphora: plural anaphora, anaphora to abstract objects, and ambiguous anaphoric expressions.


See the Publications page.


Two coding manuals were written for ARRAU 1, one for the spoken dialogue data, one for the text data:

The ARRAU 2 release was annotated using the guidelines for ARRAU 1 supplemented by instructions from GNOME for grammatical function, semantic category and genericity

The guidelines were extensively tested and revised for ARRAU 3


The ARRAU Corpus is available as follows:

  • Those sub-corpora which can be freely distributed - at the moment, GNOME and Pear Stories - can be directly downloaded from the ARRAU corpus page on Github.

  • The Penn Treebank and TRAINS data are available from LDC (here).

  • Any requester who can show they have purchased the Penn Treebank and TRAINS-93 from the LDC can also request the full corpus from the authors (contact: