Profiles

Project: Automatic Compound Processing (AuCoPro)

Name: Automatic Compound Processing

Duration2012-2013

Funded by:

  • Dutch Language Union (Belgium, The Netherlands)
  • Department of Arts and Culture (South Africa)
  • National Research Foundation (South Africa) (Grant number: 81794)
  • European Network on Word Structure (NetWordS) (European Science Foundation) (Grant number: 5570)

Project leaders:


  • http://www.clips.uantwerpen.be/
    Walter Daelemans - Project Leader: Semantics
    CLiPS - Computational Linguistics Group
    University of Antwerp, Belgium


  • http://www.tilburguniversity.edu/research/institutes-and-research-groups/ticc/cc.htm
    Menno van Zaanen - Project Leader: Segmentation
    TiCC - Tilburg Center for Cognition and Communication
    Tilburg University, The Netherlands


Project Collaborators


Other Collaborators:

  • North-West University (South Africa)
    Roald Eiselen,
    Benito Trollip, Joani Liversage, Zandre Botha, Martin Puttkammer, Martin Schlemmer, Carli de Wet, Nadia Schultz, Nanette Van Den Berg, Sansi Eiselen

  • Tilburg University (The Netherlands)
    Rick Smetsers, Nanne van Noord, Vincent Lichtenberg, Bas Goris, Sylvie Bruys, Suzanne Aussems
  • University of Antwerp (Belgium)
    Natasja Loyens, Maxim Baetens

Short URL: http://tinyurl.com/aucopro

Sourceforge URL: https://sourceforge.net/projects/aucopro/


Overview

In many human language technology applications (e.g. machine translators, spelling checkers), it often happens that concatenatively written compounds (e.g. “skrywerspen”/”schrijverspen” ‘writer’s pen’) are processed incorrectly (e.g. not found in a lexicon). From a technological perspective, deficiencies related to automatic compound segmentation are particularly problematic, since concatenative compounding is a highly productive process in many languages, including Dutch and Afrikaans. Although a compound splitter has already been developed for Afrikaans (Van Huyssteen and Van Zaanen, 2004), the reported accuracy of circa 90% could be improved, and the annotation protocol and data need to be revised. More importantly, no stand-alone compound splitter for Dutch is available; research that has been done in this field is more than ten years old (e.g. Pohlmann and Kraaij, 1996), uses expensive resources (e.g. Ordelman et al., 2003), does complete morphological analysis (e.g. De Pauw et al., 2004), and/or has not been released for re-use in the open-source domain. In subproject 1, we will therefore attempt to develop robust compound splitters for both Afrikaans and Dutch through a combination of technology recycling (Pilon et al., 2010) and data pooling (i.e. joining (converted) training material for the two languages in one training set), as well as experimentation with sequence classification (Van Zaanen & Gaustad, 2010; Van Zaanen et al., 2011).

In addition to segmentation, another subpart of this proposed project will also focus on the semantic analysis of compounds – i.e. to determine that “boekrak” construes ‘case for books’, while “houtrak” means ‘case made of wood’. For more advanced HLT applications like information extraction, question answering and machine translation systems, proper semantic analysis of compounds is required. Internationally, research on automatic compound analysis has focused almost exclusively on English; no work in this regard has been done for either Afrikaans or Dutch, and this proposed project will therefore do pioneering work in this regard.

Although linguistic research on the topic has been done for both these languages, a uniform, cross-lingual framework does not exist yet, neither does an understanding of how compounding in these two languages differs systematically (see examples above). An attempt will therefore be made to consolidate existing research on both these languages (and other languages), and to postulate a cross-lingual annotation scheme compatible with the work of Ó Séaghdha (2008). Since no semantic analyser exists for either languages, in subproject 2 we will then develop first-generation analysers for Afrikaans and Dutch simultaneously, using bootstrapping and data pooling (i.e. first develop a small training set of Afrikaans data, then train an Afrikaans analyser, then analyse Dutch data with the Afrikaans analyser, and subsequently join the data to train a next Afrikaans and/or Dutch analyser; this process continues in small increments until desired performance has been reached). We will start with techniques that work well for English (based on distributional semantics and machine learning); see Hendrickx et al. (2010) for an overview of the current state of the art. We will try to improve these techniques and adapt them to the specific requirements of Afrikaans and Dutch.

References

Daelemans, W., Buchholz, S. and Jorn Veenstra. 1999. Memory-Based Shallow Parsing. Proceedings of CoNLL-99, Bergen, Norway. June 12, 1999. 

Davel. M. and Barnard, E. 2004. A default-and-refinement approach to pronunciation prediction". In: Proceedings of PRASA. South Africa, November 2004, pp. 119–123. 

De Knop, S. and Dirven, R. 2008. Motion and location events in German, French and English: A typological, contrastive and pedagogical approach. In: De Knop, S. and De Rycker, T. (eds.) Cognitive Approaches to Pedagogical Grammar: A Volume in Honour of René Dirven. Berlin: Mouton de Gruyter. 

De Pauw, G., Laureys, T., Daelemans, W. and Van Hamme, H. 2004. A Comparison of Two Different Approaches to Morphological Analysis of Dutch. In: Proceedings of the Workshop of the ACL Special Interest Group on Computational Phonology (SIGPHON). Barcelona, Spain. pp. 62-69. 

Gast, V. forthcoming. Contrastive analysis: Theories and methods. In: Kortmann, B. and Kabatek, J. (eds.). Dictionaries of Linguistics and Communication Science: Linguistic theory and methodology. Berlin: Mouton de Gruyter. 

González, M. D. L. Á. G., Mackenzie, J. L. and Álvarez, E. M. G. 2008. Current Trends in Contrastive Linguistics: Functional and cognitive perspectives, Amsterdam, John Benjamins. 

Hendrickx, I, Kim, SM, Kozareva, Z, Nakov, P, Ó Séaghdha, D, Padó, S, Pennacchiotti, M, Romano, L & Szpakowicz, S. 2010. SemEval-2010 Task 8: Multi-Way Classification of Semantic Relations Between Pairs of Nominals. In: Proceedings of the SemEval-2 Workshop. Uppsala, Sweden. 

Hüning, M. 2009. Semantic niches and analogy in word formation: Evidence from contrastive linguistics. Languages in Contrast. 9(2): 183-201. 

Hüning, M. 2010. Diachronie in de synchronie. Over contrastieve taalkunde en taal(veranderings)theorie. In: Fenoulhet, J. and Renkema, J. (eds.) Internationale neerlandistiek: een vak in beweging. Gent: Academia Press. 

Mitchell, T.M. 1997. Machine learning. Boston: MacGraw-Hill. 

Ó Séaghdha, D. 2008. Learning compound noun semantics. Technical report 735. Cambridge: University of Cambridge. 

OECD. 2002. Proposed standard practice for surveys on research and experimental development (Frascati Manual). Eurostat. 

Ordelman, R., Van Hessen, A. and De Jong, F. 2003. Compound decomposition in Dutch large vocabulary speech recognition. In: Proceedings of Eurospeech 2003. Geneva, Switzerland. 225–228. 

Pilon, S, Van Huyssteen, GB and Augustinus, L. 2010. Converting Afrikaans to Dutch for technology recycling. In: Proceedings of the 21st Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-7992-2470-2. 22-23 November. Stellenbosch, South Africa. pp 219-224. 

Pohlmann, R and Kraaij, W. 1996. Improving the precision of a text retrieval system with compound analysis. In: Proceedings of the 7th Computational Linguistics in the Netherlands (CLIN 1996). pp. 115-129. 

Quinlan, J.R. 1987. Generating production rules from decision trees. In: McDermott, J. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (IJCAI-87): 304–307. 

Van Huyssteen, GB and Van Zaanen, MM. 2004. Learning Compound Boundaries for Afrikaans Spelling Checking. In: Proceedings of First Workshop on International Proofing Tools and Language Technologies. Patras, Greece. pp. 101-108. 

Van Huyssteen, GB. 2005. ’n Kognitiewe gebruiksgebaseerde beskrywingsmodel vir die Afrikaanse grammatika. [A Cognitive Usage-Based Description Model for Afrikaans Grammar]. Southern African Linguistics and Applied Language Studies. 23(2): pp. 125-137. 

Van Zaanen, M & Gaustad T. 2010. Grammatical Inference as Class Discrimination. In: Sempere, J & García, P. (eds.). Grammatical Inference: Theoretical Results and Applications. 6339, 245–257. 

Van Zaanen, M, Gaustad T & Feijen J. 2011. Influence of Size on Pattern-based Sequence Classification. In: Van der Putten, P, Veenman, C, Vanschoren, J, Israel, M & Blockeel, H. (eds.). Proceedings of the 20th Belgian-Dutch Conference on Machine Learning. The Hague, The Netherlands. pp 53–60. 

Veenstra, J., Van den Bosch, A., Buchholz, S., Daelemans, W. and Zavrel, J. 2000. Memory-Based Word Sense Disambiguation. Computers and the Humanities. 34(1-2): 171-177. 

Aims

The primary aim of this project is to develop resources (including annotation protocols, and training and testing data) for the development of:
  • robust compound splitters (subproject 1); and 
  • first-generation compound analysers (subproject 2); 
for Afrikaans and Dutch, through a combination of cross-language transfer (i.e. technology recycling), data pooling, and various machine learning approaches.

Other secondary aims include:
  • to report on the research and development process in the form of: 
    • one Master’s degree dissertation; 
    • two fourth-year student’s projects (mini-dissertation); 
    • at least two scholarly papers, to be published in relevant journals or peer-reviewed conference proceedings; 
    • various annotation protocols, made available publicly; and 
  • to contribute towards human capital development and growth of the pool of experts in descriptive linguistics and computational linguistics in South Africa, Belgium and The Netherlands by offering bursaries, grants or contract work to undergraduate and post-graduate students. 
  • to extend the collaboration network between North-West University (NWU), Tilburg University (TU) and University of Antwerp (UA), by introducing young scholars and students to each other (i.e. extending the existing collaboration beyond Van Huyssteen–Van Zaanen–Daelemans); 
  • to identify new research issues as they unfold in the research and development process; and 
  • to contribute to the HLT-enabling of the languages of South Africa.

 Outputs

Publications

  • van Zaanen, M., van Huyssteen, G., Aussems, S., Emmery, C., & Eiselen, R. In Press. The Development of Dutch and Afrikaans Language Resources for Compound Boundary Analysis. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014). May. Reykjavik, Iceland.
  • Verhoeven, B., van Huyssteen, G., van Zaanen, M., & Daelemans, W. In Press. Annotation Guidelines for Compound Analysis. In: CLiPS Technical Report Series (CTRS), 5. ISSN: 2033-3544.
  • Verhoeven, B., & Daelemans, W. 2013. Semantic Classification of Dutch Noun-Noun Compounds: A Distributional Semantics Approach. In: CLIN Journal, 3: 2-18. ISSN: 2211-4009. [paper]
  • Botha, Z., Eiselen, R., & van Huyssteen, G. 2013. Automatic Compound Semantic Analysis using Wordnets. In: Proceedings of the Twenty-Fourth Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-86970-771-5. 3 December. Pretoria, South Africa. pp. 1-6. [paper]

  • Aussems, S., Goris, B., Lichtenberg, V., van Noord, N., Smetsers, R., & van Zaanen, M. 2013. Unsupervised identification of compounds. In: Proceedings of the 22nd Annual Belgian-Dutch Conference on Machine Learning (Benelearn). 3 June. Nijmegen, The Netherlands. [paper]

  • Verhoeven, B., & van Huyssteen, G.B. 2013. More Than Only Noun-Noun Compounds: Towards an annotation scheme for the semantic modelling of other noun compound types. In: Proceedings of the Ninth Joint ACL - ISO Workshop on Interoperable Semantic Annotation. 19-20 March. Potsdam, Germany. [paper] [presentation]
  • Verhoeven, B., Daelemans, W., & van Huyssteen, G.B. 2012. Classification of Noun-Noun Compound Semantics in Dutch and Afrikaans. In: Proceedings of the Twenty-Third Annual Symposium of the Pattern Recognition Association of South Africa. ISBN: 978-0-620-54601-0. 29-30 November. Pretoria, South Africa. pp. 121-125. [paper] [presentation]

Resources

    • Annotation Guidelines for Compound Segmentation.
    • Annotation Guidelines for the Semantic Analysis of Noun-Noun Compounds in English, Dutch and Afrikaans.
      Including: Decision Tree and Paraphrasing Table

    • Annotation Guidelines for the Semantic Analysis of Other Nominal Compounds in Dutch and Afrikaans
      Specifically: Adjective-Noun, Verb-Noun, Quantifier-Noun and Preposition-Noun

  • Compound Semantics Dataset (compounds with semantic annotation)
    • Afrikaans
      • Afr-NN-FirstRound (1449 compounds)
      • Afr-NN-SecondRound (2328 compounds)
      • Afr-XN (4553 compounds)
    • Dutch
      • Ned-NN-FirstRound (1766 compounds)
      • Ned-NN-SecondRound (2000 compounds)
      • Ned-XN (600 compounds)
  • Compound Splitting Dataset (compounds annotated with constituent boundaries and linking elements)
    • Afrikaans (25,266 compounds)
    • Dutch (26,000 compounds)
For more information on these resources: contact Ben Verhoeven.

Talks 

  • Verhoeven, B., Daelemans, W., & van Huyssteen, G.B. 2013. Semantic Classification of Dutch and Afrikaans Noun-Noun Compounds. Presentation presented at the 5th Workshop on African Language Technology (AfLaT 2013), Ghent, Belgium. 6 December 2013.
  • van Huyssteen, G.B., Verhoeven, B., & Daelemans, W. 2013. Bringing together interdisciplinary perspectives on compound semantics: Examples from Afrikaans and Dutch in the CompoNet database. Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013. [prezi]
  • Liversage, J., & van Huyssteen, G.B. 2013. Verifiëring van semantiese verhoudings in Afrikaanse naamwoord-naamwoordsamenstellings. [Verification of semantic relations in Afrikaans noun-noun compounds.] Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013. [prezi]
  • van den Berg, N., & van Huyssteen, G.B. 2013. Samestellings met en afleidings van meerledige eiename. [Compounds of and derivations with multi-part proper names.] Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013. [prezi]
  • Trollip, B., & van Huyssteen, G.B. 2013. Herbeskouing van die interfiks in Afrikaans. [Reconsideration of the interfix in Afrikaans.] Presentation presented at South African Microlinguistics Workshop (SAMWOP 2013), Vanderbijlpark, South Africa. 1 November 2013.
  • Verhoeven, B., van Huyssteen, G.B., & Daelemans, W. 2013. Samenstellingen in het Afrikaans en Nederlands: Automatische semantische analyse en taalkundige implicaties. [Compounding in Afrikaans and Dutch: Automatic semantic analysis and linguistic implications.] Presentation presented at Graduate Conference of the Departement of Linguistics, University of Antwerp, Belgium. 2 October 2013.
  • Verhoeven, B., van Huyssteen, G.B., & Daelemans, W. 2013. Samenstellingen in het Afrikaans en Nederlands: Automatische semantische analyse en taalkundige implicaties. [Compounding in Afrikaans and Dutch: Automatic semantic analysis and linguistic implications.] Presentation presented at Internationaal Seminarie Afrikaans, Ghent, Belgium. 9 September 2013.

  • Verhoeven, B., Daelemans, W., & van Huyssteen, G.B. 2013. Semantic Classification of Dutch and Afrikaans Noun-Noun Compounds. Presentation presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands. 18 January 2013. [pdf]

  • Aussems, S., Bruys, S., Goris, B., Lichtenberg, V., van Noord, N., Smetsers, R., & van Zaanen, M. 2013. Automatically Identifying Compounds. Presentation presented at the 23rd Meeting of Computational Linguistics in the Netherlands (CLIN 2013), Enschede, The Netherlands. 18 January 2013.

  • Verhoeven, B., & Daelemans, W. 2012. Automatic Compound Processing (AuCoPro) - Semantic Analysis. Presentation presented at ATILA 2012, Groesbeek, The Netherlands. 23 November 2012. [pdf]

  • Van Zaanen, M. 2012. Automatic Compound Processing (AuCoPro) - Identification for Segmentation. Presentation presented at ATILA 2012, Groesbeek, The Netherlands. 23 November 2012. [pdf]

  • Verhoeven, B. 2012. AuCoPro: Project Presentation and Recent Developments. Presentation presented at Centre for Text Technology (CTexT), North-West University, Potchefstroom, South Africa. 7 September 2012. [pdf]

Dissertations (unpublished)

MASTER
  • Verhoeven, B. 2012. A Computational Semantic Analysis of Noun Compounds in Dutch. MA Thesis, University of Antwerp, Belgium. [pdf]

HONORS

  • Trollip, B. Herbeskouing van die interfiks in Afrikaanse komposita. [Reconsidering the interfix in Afrikaans compounds]. 2013. Honors Dissertation, North-West University, Potchefstroom, South Africa.
  • Liversage, J. Verifiëring van semantiese verhoudings in Afrikaanse naamwoord-naamwoordsamestellings. [Verification of semantic relations in Afrikaans noun-noun compounds]. 2013. Honors Dissertation, North-West University, Potchefstroom, South Africa.
  • van den Berg, N. Samestellings met en afleidings van meerledige eiename in Afrikaans en Nederlands. [Compounds with and derivations of multiple proper names in Afrikaans and Dutch]. 2013. Honors Dissertation, North-West University, Potchefstroom, South Africa.

BACHELOR

  • Trollip, B. 2012. Die klassifikasiemoontlikhede van nie-prototipiese samestellings. [The classification possibilities of non-prototypical compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
  • De Wet, C. 2012. Semantiese ontleding van Afrikaanse NN-samestellings. [Semantic analysis of Afrikaans NN-compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
  • Schultz, N. 2012. Die ontwikkeling van 'n verteenwoordigende verwysende datastel van Afrikaanse samestellings. [The development of a representative referential dataset of Afrikaans compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.
  • Liversage, J. 2012. Voorgestelde protokol vir die verwerking van X+N samestellings. [Proposed protocol for the processing of X+N compounds]. BA Dissertation, North-West University, Potchefstroom, South Africa.

Related projects/links

  • Scalise, S. CompoNet. University of Bologna, Italy. http://componet.sslmit.unibo.it
    CompoNet is a descriptive compound database for 27 languages, including Dutch and Afrikaans.





Subpages (1): Files
Comments