Tutorial at ICON2022

Discourse Structure Analysis of Indian Languages (DSAIL2022)

December 18, 2022 | IIIT Delhi, India

Tutorial goals

In this full-day tutorial, we focus on the analysis of discourse structure of Indian languages. In doing so, we will introduce the basic concepts of discourse structure such as discourse segments and discourse relations, and explain how these concepts are dealt with in discourse structure theories like RST and PDTB. In addition, we will provide the participants hands-on experience of the application of these theories to text analysis taking cases from various Indian languages. Furthermore, we will discuss the prospects of using these theories for conducting discourse research in Indian languages and developing necessary linguistic resources and infrastructure.

Motivation

A discourse or multi-sentential text displays systematic patterns or structures, which are primarily established through the use of discourse relations (also called coherence or rhetorical relations). These relations encode a semantic or pragmatic relationship between two discourse segments (e.g., cause, condition, contrast, elaboration). The discourse structure of a text can be explained by different theories, like RST or PDTB.  RST (Rhetorical Structure Theory) (Mann and Thompson, 1988; Taboada and Mann, 2006) is a descriptive theory of text organization which accounts for the complete structure of a text, representing both local level relations (holding between basic discourse units) and global level relations (holding between larger text units) in a tree diagram structure. RST has been used to develop discourse annotated corpora and discourse parsers for English and many other (non-Indian) languages. PDTB is widely known for the Penn Discourse Treebank corpus (hence, the acronym PDTB) (Prasad et al., 2008; Webber et al., 2018) and also its well-developed annotation scheme. PDTB provides annotation for discourse connectives (e.g., but, if, therefore), their arguments (discourse segments) and discourse relations. However, PDTB, unlike RST, focuses on annotating each relation individually disregarding any surrounding structures. The PDTB framework, in addition to discourse corpora in different languages, has also been used to develop shallow discourse parsers extensively. 

Unlike many other (mostly European) languages, the research on discourse topics in Indian languages, by and large, suffers from the lack of reliable linguistic resources and digital infrastructure. Nevertheless, some initiatives have been taken to revive the situation. For example, Hindi hosts a medium-sized PDTB-style corpus (Oza et al., 2009), which inspired a few studies in the language, mostly on connective disambiguation (Jain et al., 2016) and argument identification (Jain and Sharma, 2016). Similar annotation initiatives, albeit in smaller scales, have also been undertaken for Dravidian languages like Tamil (Rachakonda and Sharma, 2011) and Malayalam (Gopalan et al., 2017). For Bangla, a lexicon of discourse connectives has been developed (Das et al., 2020), and the work on an RST corpus is in progress (Das and Stede, 2018). 

We strongly believe that discourse-based research has a promising future for Indian languages, particularly for two main reasons. First, such research activities will contribute to developing the necessary resources and infrastructure for different low-/under-resourced Indian languages. Second, the resulting linguistic analyses will not only inform about the discourse properties of those languages, but also they will offer a strong testing ground for the application of the major discourse structure theories, which have hitherto been confined mainly to the exploration of non-Indian languages. In the DSAIL2022 tutorial, we envisage to address these goals in greater detail.


References

Lynn Carlson, Daniel Marcu, and Mary Ellen Okurowski. 2002. RST Discourse Treebank, LDC2002T07.120

Debopam Das, Tatjana Scheffler, Peter Bourgonje, and Manfred Stede. 2018. Constructing a Lexicon of English Discourse Connectives. In Proceedings of the 19th Annual SIGdial Meeting on Discourse and Dialogue, pages 360–365, Melbourne, Australia. Association for Computational Linguistics.

Debopam Das and Manfred Stede. 2018. Developing the Bangla RST Discourse Treebank. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), pages 1832–1838, Miyazaki, Japan. European Language Resources Association (ELRA).

Debopam Das, Manfred Stede, Soumya Sankar Ghosh, and Lahari Chatterjee. 2020. DiMLex-Bangla: A lexicon of Bangla discourse connectives. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1097–1102, Marseille, France. European Language Resources Association.

Debopam Das, Maite Taboada, and Paul McFetridge. 2015. RST Signalling Corpus, LDC2015T10. Philadelphia. Linguistic Data Consortium.

Sindhuja Gopalan, Lakshmi S, and Sobha Lalitha Devi. 2017. Cross linguistic variations in discourse relations among Indian languages. In Proceedings of the 14th International Conference on Natural Language Processing (ICON-2017), pages 402–407, Kolkata, India. NLP Association of India.

Rohit Jain and Dipti Sharma. 2016. Explicit argument identification for discourse parsing in Hindi: A hybrid pipeline. In Proceedings of the NAACL Student Research Workshop, pages 66–72, San Diego, California. Association for Computational Linguistics.

Rohit Jain, Himanshu Sharma, and Dipti Sharma. 2016. Using lexical and dependency features to disambiguate discourse connectives in Hindi. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1750–1754, Portorož, Slovenia. European Language Resources Association (ELRA).

William Mann and Sandra Thompson. 1988. Rhetorical Structure Theory: Towards a Functional Theory of Text Organization. Text, 8:243–281.

Umangi Oza, Rashmi Prasad, Sudheer Kolachina, Dipti Misra Sharma, and Aravind Joshi. 2009. The Hindi discourse relation bank. In Proceedings of the Third Linguistic Annotation Workshop (LAW III), pages 158–161, Suntec, Singapore. Association for Computational Linguistics.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio Robaldo, Aravind Joshi, and Bonnie Webber. 2008. The Penn Discourse Treebank 2.0. In Proc. of the 6th International Conference on Language Resources and Evaluation (LREC), pages 2961–2968, Marrakech, Morocco.

Ravi Teja Rachakonda and Dipti Misra Sharma. 2011. Creating an annotated Tamil corpus as a discourse resource. In Proceedings of the 5th Linguistic Annotation Workshop, pages 119–123, Portland, Oregon, USA. Association for Computational Linguistics.

Maite Taboada and William C. Mann. 2006. Rhetorical structure theory: Looking back and moving ahead. Discourse Studies, 8(3):423–459.

Bonnie Webber, Rashmi Prasad, Alan Lee, and Aravind Joshi. 2018. The Penn Discourse Treebank 3.0 Annotation Manual. Technical report, The University of Pennsylvania.