RST++: A Signaled Graph Theory of Discourse Relations and Organization (Project PI: Dr. Amir Zeldes; January 2023 – Present; Georgetown University)

This project presnts RST++, a new theoretical framework for computational discourse analysis, based on an expansion of Rhetorical Structure Theory (RST), targeting discourse relation graphs with tree-breaking, non-projective discourse relations, as well as implicit and explicit discourse marking signals which give explainable rationales to our analyses. We survey shortcomings of RST and other existing frameworks, such as Segmented Discourse Representation Theory (SDRT), the Penn Discourse Treebank (PDTB) and Discourse Dependencies, and address these using constructs in the proposed theory. We provide annotation, search and visualization tools for data, and present and evaluate a freely available corpus of English annotated according to our framework, encompassing 12 spoken and written genres with over 200K tokens.


Continuity in Discourse Relations (Project PIs: Dr. Debopam Das and Dr. Markus Egg; November 2020 – Present; Humboldt University of Berlin)

This project investigates the role of (dis)continuity in discourse relations (relations between propositions, beliefs or speech acts, such as Condition or Claim-Argument). The notion of (dis)continuity in discourse occupies a central place in the deictic shift theory. Discourse relations are considered either continuous (e.g., Continuation, Elaboration) or discontinuous (e.g., Contrast, Comparison), based on preserving or shifting deictic centres along dimensions such as spatio-temporal setting, topicalized referents or perspective. In our present work, we re-evaluate the definition of continuity in discourse relations, and examine Givón’s (1993) seven dimensions of deictic shifts (time, space, reference, action, perspective, modality and speech act). Our preliminary results, based on the analysis of Causal and Contrastive relations in the RST Discourse Treebank, show that (dis)continuity in coherence relations operates more as a multifaceted phenomenon than a categorical one. A relation can simultaneously show evidence for discontinuity only for certain dimensions but not necessarily for others (e.g., Contrast, otherwise deemed to be a discontinuous, exhibits referential continuity). Also, discourse relations show different degrees of (dis)continuity, and continuity functions more as a gradient phenomenon than a bipolar one. In the next phase of this work, we would investigate the influence of (dis)continuity on the signalling of discourse relations.


Discourse Strategies across Social Media  (Project PIs: Dr. Manfred Stede and Dr. Tatjana Scheffler; August 2018 – December 2018; University of Potsdam)

This project applies PDTB- style annotation to Twitter conversations, enabling detailed investigations of the discourse structure of conversations on social media. We develop a corpus of 185 Twitter conversation threads to investigate how Twitter discourses differ from written news text, with respect to discourse connectives and relations. Results show that discourse relations in written social media conversations are expressed differently than in (news) text. More particularly, connective arguments frequently are not full syntactic clauses, and that a few general connectives expressing Expansion and Contingency make up the majority of the explicit relations in our data.


The Bangla Discourse Connective Lexicon (Project PIs: Dr. Debopam Das and Dr. Manfred Stede; January 2018November 2018; University of Potsdam)

This project develops a lexicon of discourse connectives for Bangla. Discourse connectives are lexical expressions which represent a two-place relation and they take abstract objects (propositions, events, states, or processes) as their arguments. We compile a list of over 100 Bangla connectives, and provide information on their syntactic categories, discourse semantics and non-connective uses (if any). The format follows the German connective lexicon DiMLex, which provides a crosslinguistically applicable XML schema.


The Bangla RST Discourse Treebank (Project PIs: Dr. Debopam Das and Dr. Manfred Stede; May 2017 – July 2018; University of Potsdam)

This project aims to develop a corpus in Bangla (an Indo-Aryan language) annotated for coherence relations (according to RST) and relational signals. The corpus contains 266 texts, comprising 71,009 words, with an average of 267 words per text. The corpus represents newspaper genre. The texts have been collected from a popular Bangla daily called Anandabazar Patrika published in India. The corpus started with the annotation of 16 texts, which were evaluated for agreement among the annotators. The present work includes annotation of the remaining 250 more texts, representative of different sub-genres in the newspaper genre.


Underspecification and RST (Project PI: Dr. Manfred Stede; September 2016 – September 2017; University of Potsdam)

This project examines the disagreement in Rhetorical Structure Theory annotation which takes into account what we consider "legitimate" disagreements. In rhetorical analysis, as in many other pragmatic annotation tasks, a certain amount of disagreement is to be expected, and it is important to distinguish true mistakes from legitimate disagreements due to different possible interpretations of the structure and intention of a text. Using different sets of annotations in German and English, we present an analysis of such possible disagreements, and propose an underspecified representation that captures the disagreements.


Discourse Relations and Appraisal (Project PI: Dr. Maite Taboada; September 2014 – August 2016; Simon Fraser University)

This project investigates the relationship between coherence relations (relations between propositions) and appraisal. In particular, we examine the role of coherence relations in the interpretation of evaluative words. By combining RST and Appraisal Theory, we analyze how different types of coherence relation influence the evaluative content expressed by nouns, adjectives, adverbs and verbs found in the relational unit. We found that relations such as Concession, Elaboration, Evaluation, Evidence and Restatement most frequently intensify the polarity of opinion words. We also find that most opinion words (about 70 percent) are positioned in the nucleus.


Signalling of Coherence Relations in Discourse (PhD Project Supervisors: (The late) Dr. Paul McFetridge and Dr. Maite Taboada; September 2009 – August 2014; Simon Fraser University)

This project (also my PhD project) investigates how coherence relations are signalled in discourse, and what signals are used to indicate them. A secondary goal of this study is to examine whether coherence relations are more frequently explicit or implicit in terms of the type of signalling involved. I conducted a corpus study, examining the RST Discourse Treebank which includes a collection of 385 Wall Street Journal articles annotated for rhetorical (or coherence) relations. I examined each and every relation in that corpus, identifying the signals for those relations, and finally, adding a new layer of annotation to them, to include signalling information. Results from my corpus study show that the majority of relations (over 90%) in a discourse are signalled (sometimes by multiple signals), and also that the majority of signalled relations (over 80%) are indicated by signals other than discourse markers, such as lexical, semantic, syntactic and graphical features.


Computational Analysis of Text Sentiment (PI: Project Supervisor: Dr. Maite Taboada; September 2009 – December 2012; Simon Fraser University)

The goal of this project is to develop a computational system for automatically extracting sentiment from any given text. Sentiment is characterized as positive or negative views expressed by the subjective content of a text (e.g., an opinion piece in a newspaper or a movie review). We hypothesize that, given a text, we can determine whether it contains sentiment or subjective content, and if it does, we can also determine the type of the sentiment – categorically positive or negative, based on the analysis of the discourse structure of the text. In this project, my contributions were related to developing resources for discourse parsing. Specifically, I conducted a corpus study in order to extract relevant linguistic signals (e.g., discourse markers) of coherence relations, and then formulated rules for identifying coherence relations in unseen texts based on the contextual information about the occurrence of those signals.


A Modern Dictionary for Readers with Vernacular Different than Bengali (Project PI: Prof. Pabitra Sarkar; January 2009 – July 2009; The Asiatic Society, India)

This project developed a detailed encyclopedic bilingual dictionary (from Bengali to English direction) in six volumes with an eye to facilitate understanding of the Bengali language by providing elaborate but precise information on Bengali words and their usages. In this project, I worked on entries dealing with biographical sketches of important personalities who had some significant social, cultural and political contribution for Bengal and its people.


Defining Key Concepts in Linguistics: A Bilingual Approach with Text-Machine Interface (PI: Project PI: Dr. Krishna Bhattacharya; August 2007 – December 2008; University of Calcutta, India)

This project developed a precise and convenient bilingual dictionary (in Bengali and English) on Linguistics to cover common concepts and frequently used terms in that discipline, specifically citing examples from Indian as well as other foreign languages to illustrate concepts. In addition, it addressed the problems of standardizing Linguistic terminology in Bengali. In this project, my contributions were related to (i) collecting, scrutinizing and justifying the English and Bengali entries (relevant linguistic key terms) for the dictionary, (ii) defining those entries in both English and Bengali, and (iii) citing appropriate examples from various languages to illustrate those linguistic concepts.