Verb-initiality in AA

The development of verb-initial structures cross-linguistically: insight from Austroasiatic

2018-2021 (SNF 100015_176264)

Mathias Jenny, Hiram Ring, Wei-Wei Lee

Project description

In linguistics, the verb is traditionally considered to be the core element of the clause, and its core status is emphasized in psycho- and neurolinguistic research as well as being the basis of many theories on the syntax/semantic interface (see Levin & Rappaport-Hovav 2005 for an overview). At the same time, basic verb-initial (V1) word order reportedly occurs in only ca. 10-19% of the world's languages (c.f. Dryer 2013, Clemens & Polinsky to appear). Explanations for this may come from recent experiments and large-scale comparisons, which show that V1 sequences in natural languages violate apparent cognitive or processing biases on at least three levels: information structure (“ɢɪᴠᴇɴ before ɴᴇᴡ principle”; Chafe 1994, Lambrecht 1994, Junge et al. 2015), dependency (“dependency length minimization”; Hawkins 2004, Newmeyer 2005, Futrell et al. 2015), and parsing (“Actor-first”; cf. Choudhary 2010).

This research suggests that V1 structures are dispreferred cross-linguistically since a V1 language will necessarily violate at least two of these biases. However, 10% (or more) of the world’s languages translates to quite a large number of languages that have V1 structures, and such languages can have quite stable word order. Further, the extremely robust Actor-first bias does not seem to apply in V1 sentences (Schlesewsky & Bornkessel 2006, Bickel et al. 2015), suggesting the possibility of a (not yet identified) driving force that counteracts the Actor-first preference. The most likely way to identify such a force is to trace the evolution of V1 structures in the history of language families and to compare the findings to specific effects of horizontal transfer (areal skewings), universal drifts, and chance.

Word order is highly prone to change, from both internal (i.e. reanalysis, pragmatics) and external (i.e. areal and contact) influence (Harris & Campbell 1995, Aikhenvald & Dixon 2007), so all possible factors need to be considered. Accordingly, this project takes a broad, corpus-based approach to study the evolution and distribution of V1 structures in Austroasiatic (AA) languages, which are of particular interest for the study of V1 configurations (Jenny et al. 2014). We then compare the findings with V1 structures in other language families to investigate V1 development (as e.g. in Insular Celtic, Modern Welsh), V1 maintenance (e.g. AA, Austronesian, Modern Irish), and V1 loss (e.g. Afroasiatic, western Austronesian, Modern Breton, Middle Welsh). This project focuses on AA languages from the Khasian, Palaungic, Nicobarese, Aslian, and Katuic groups, namely because: 1) they exhibit V1 patterns to different degrees and in different structures, 2) the V1 patterns cannot be explained as result of language contact, and 3) they are spoken on the periphery of the AA area as local rather than state languages.

This last point (3) is particularly salient, since peripheral (residual) languages are expected to be more conservative (and thus more likely to contribute to reconstruction) than central (or “spread”) languages (Nichols 1992, Dixon 1997; but cf. Celtic and Tocharian). These peripheral languages are understudied, so we are more likely to gain new insight into the effects and development of V1 structures. The inability to explain these structures through language contact (point 2) means we must consider a diachronic source. The range of word orders in these languages (point 1) means that we can more likely account for V1 features diachronically, particularly since they are from the same tree. This in turn will give us indications of likely developmental pathways of word order for other language phyla and families.

The basic question we are asking is: “what V1 structures do we find where and when, what motivates their existence, and how are they maintained, developed, or lost?” Secondary questions are: “how does verb-initiality interact with or motivate subsystems of grammar?”, and “to what degree do languages differ with regard to V1 structures?” These are difficult questions, particularly due to the complexity of the research, what is known of the languages, and the history of the areas under study.


Final report 2021

1. Preliminaries, starting point, objectives of the project

The centrality of the verb in language, and particularly where it is placed within a sentence is of utmost importance to our understanding of languages, yet some verbal positions seem to be more preferred than others. For example, verb-initiality (placing the verb at the beginning of the clause) is not widely reported for the world’s languages, occurring in only 10-19% of the 7,000+ languages in the world. Several cognitive/processing constraints suggest why this is the case, yet such structures can be remarkably stable within a language. The overall goal of this project was therefore to investigate V1 structures in multiple languages within multiple branches of a single language family/phylum with a view toward understanding the development and loss of verb-initiality more generally. Secondary goals were 1) to build a database of transcribed and translated texts in multiple languages that could be used for future research, and 2) develop computational tools and methods for comparing syntactic structures between languages.

For the languages under investigation, we chose the Austroasiatic family/phylum, since verb-initial structures are prevalent in more than one of the branches within the family. Also, these languages are under-studied and are therefore more likely to provide important insights into verb-initial structures. Initial investigation also revealed the presence of verb-initial structures in branches where it had not been previously reported, suggesting an important historical dimension to verb-initiality in the Austroasiatic family/phylum that could only be understood through comparison of the languages in question.

Our approach was corpus-based, building a database of transcribed and translated texts. We developed computational tools to search through this database for comparable structures. We then compared structures and identified patterns in order to propose hypotheses regarding the structural changes between related languages.

2. Data collection, annotation, and analysis

We collected data from 15 languages in 6 branches of Austroasiatic, digitizing existing material as well as collecting new data through fieldwork on languages in Myanmar, two of which (Rucing, Htanaw) have been hardly described before. While collecting data we also began to annotate clause structure of the languages and analyze and compare structures between the languages both within and across branches.

The fieldwork in Myanmar was made possible by the cooperation with two local universities, namely the University of Mandalay, Department of Anthropology, and the Yangon University of Foreign Languages. The team members spent several weeks in Myanmar on different field trips, collecting audio and video recordings of Palaungic languages in different locations. The recordings included elicitation, guided and spontaneous conversations, narratives, as well as background interviews on language use and attitude. Native language consultants ranged in age between 18 and 70+ years and included both female and male speakers in all locations. The recorded data was transcribed, translated (partly with the help of native speakers) and screened before sample texts were selected for further detailed analysis. Given the restrictions in time and personnel, a selection was unavoidable, but the range of genres and speakers ensured that we gathered a representative part of the recordings.

The goal of this process was to build a large enough and diverse enough database of texts and grammatical structures that could then be filtered using computational tools and methods. Since this kind of comparative linguistic work is highly complex, it has traditionally been done manually by domain experts. This is extremely time consuming and difficult, conducted by individual experts with small (largely undigitized) corpora. Because of this no computer-aided methods which would speed up comparative historical syntax have been developed. By constructing a large and diverse dataset, we were able to pioneer and test new methodologies and tools for comparing grammatical structures both within and between languages.

3. Methods applied and developed

We devised a new workflow using existing open-source tools to interlinearize and annotate our data, along with cloud computing services to collaborate effectively even when in disparate locations. Additionally, we created new tools using the Python programming language to automatically compare clauses and explored the use of word vectors for sorting verbs into categories.

Our workflow involved 1) digital transcription and translation of recordings with language consultants, 2) interlinearization and annotation of the transcriptions using Toolbox,[1] 3) conversion of Toolbox data to XML format (for robust data preservation) using Xigt,[2] 4) conversion of XML data to spreadsheets for clause-level annotation of grammatical structures, 5) database creation from multiple texts in multiple languages which preserved all levels of annotation, 6) sorting and filtering of the database in order to explore and observe grammatical patterns that could be compared within and between languages. Of these steps, while 1-3 used existing tools, tools for steps 4-6 had to be newly developed for this project.

We maintained a cloud-based project folder which contained regular uploads of sound recordings, transcriptions, interlinearized Toolbox data, annotated spreadsheets, and results of analysis. Using this folder, spreadsheets could be edited from any location by multiple project members, allowing for rapid annotation. These spreadsheets were then used as the basis for database building, filtering, and analysis tools maintained in a separate folder. This meant that as the spreadsheets were gradually annotated, the database could be easily updated and filtering/analyses could be re-run, enabling some degree of real-time testing of multiple computational approaches.

We experimented with several different means of sorting/filtering the data, refining our process as we went. Since we were annotating at multiple levels within a clause, we initially attempted to automatically compare clauses based on four tiers: phonetic/phonological, morphological/lexical, semantics, and grammatical role. Each of the four tiers for a single clause was compared pairwise with each of the four tiers of every other clause in the dataset (both within and between languages) using the Monge-Elkan algorithm (see Jimenez et al. 2009; Monge & Elkan 1996, 1997) which measures textual similarity (Table 1). We were able to demonstrate that this was effective in reducing the overall number of clauses for manual comparison (from roughly 23 million to a bit over 1,000 in an early iteration of this approach) when we set a threshold of 70% similarity. We also showed that this comparative method performed as expected (Fig. 1): in a randomized dataset the algorithm showed a normal distribution of grammatical structures, whereas in our non-randomized dataset it revealed a non-normal distribution, highlighting the presence of common grammatical structures, which is to be expected for related languages.

Table 1: Example of Monge-Elkan comparison of clauses

Pnar (NDE.007_4)

DaraangPalaung (IPI.021_1)

Monge-Elkan score

ipa:

['jap', 'kɔ', 'ka=wi'],

['jăm', 'bɔ', 'ʔa', 'do']

0.67

pos:

['v', 'pro', 'clitic=num'],

['v', 'aspect', 'num', 'clsf']

0.72

gloss:

['die', '3sg.F', 'F=one'],

['die', 'PROB', 'two', 'CLF']

0.50

struct:

['mV', 'mS'],

['mV', 'mS']

1.00

Normalized total:

0.72

Figure 1: Pairwise similarity between clauses

However, there are a number of shortcomings to this approach, not least of which are that it 1) does not account for all differences in semantics, 2) does not reduce the complexity of comparative forms sufficiently, and 3) it is not always clear that the comparisons being made between clauses are valid from a linguistic perspective. As an example, in linguistic annotation, the meaning of a semantic/grammatical gloss such as NMZ ‘nominalizer’ and NOM ‘nominative argument’ are rather different. Under the Monge-Elkan algorithm, however, these are considered more similar than NOM ‘nominative argument’ and ACC ‘accusative argument’, which might be considered by a linguist to be more grammatically similar. We hoped that by comparing multiple tiers this would be somewhat mitigated, however the number of results filtered for manual comparison was still unmanageable based on our conditions and requirements.

As a result of these important findings, while we continue to consider the possibility of automatic clause comparison, we decided to pursue a computer-assisted rather than fully automated approach. This involved identifying verbs and groups of verbs which showed multiple verbal positions within clauses of individual languages in several branches of Austroasiatic. Since our corpus is relatively small, we considered that grouping verbs into semantic types might result in better comparative data, allowing us to reconstruct clauses at a more abstract level of representation. We used semantic groupings for English from Levin (1993) as the basis for our own groupings, making adjustments for the semantics of Austroasiatic languages. We also created groupings ‘from the data’ by using the statistical co-occurrence of verbs with neighboring words to train a neural network to produce “word vectors” that serve as mathematical representations of distributional semantics (see Mikolov et al. 2013). Both of these approaches are promising, but neither of them reduced the amount of effort needed to sort through the resulting example clauses, and in some cases resulted in misclassification of clause similarity. Further, since the amount of data needed to train robust distributional semantic models for lexical items is extremely large (Google’s “Word2Vec” is trained on billions of sentences) we considered that the resulting word vectors from our small corpus were largely uninformative for identifying verbal groupings (Fig. 2).

Figure 2: Principal component analysis visualization of verbal groupings from word vectors

Due to these concerns, we determined that the best use of our resources was to directly compare individual verbs in the languages within our corpus, using scripts to automatically extract examples. In order to identify which verbs to focus on, we first filtered clauses based on whether verbs occurred in multiple languages within the corpus. We then plotted (by language) the number of occurrences of each verbal position in clauses which contained each of these verbs, extracting each set of clauses from the languages in question so that the variation in verb placement could be directly observed. While this reduced the overall comparanda for our investigation, it also enabled us to understand (for particular verbs) the allowable variation in a particular language, what constraints there might be on such variation, and then to compare one language’s variation with others.

4. Results (preliminary)

Our preliminary results show that verb-initial structures are indeed widespread throughout many of the Austroasiatic languages under study, but that their occurrence in some languages is constrained to certain clause types or particular kinds of constructions. In some of the languages under investigation, verb-initial structures do not occur at all. The presence of verb-initial structures in some of the languages, however, means that we must posit verb-initial structures as being acceptable for many clauses at the Proto-Austroasiatic stage.

Individual research by the PhD student Wei-Wei Lee within the project interestingly has led to alternative explanatory possibilities for verb-initial clauses in one of the focus languages, namely Palaung. In her dissertation, Ms. Lee shows convincingly that reanalysis of nominal forms as clauses can be seen as the origin of verb-initiality in this group. This highlights a strength of our bottom-up approach: that it supports robust scientific inquiry rather than circularity, in that our working hypothesis can be falsified.

The project team is in the process of writing a monograph on the scope, background, and development of the project that has garnered interest from Brill. This book is intended to lay out the theoretical basis for our approach to syntactic reconstruction, to expand on the methods and experiments presented briefly in the current report, and to provide initial reconstructions of a series of clauses and clause types for three branches of Austroasiatic: Khasian, Palaungic, and Katuic. The book will serve as a means of broadening the discussion of possibilities in syntactic reconstruction and automated methods, as well as garner feedback and critique from other researchers working in this area. The first four introductory chapters laying out our theoretical and methodological approach and detailing specific automated methods have been drafted. We hope to complete the final chapters by the middle of 2021, presenting hypotheses and reconstructions of clauses in several branches of Austroasiatic.

5. Outcomes: databases, scripts, publications (finished and in progress)

The project team produced a number of publications during the duration of the project, some already published in their final form, some in the final steps of publication, and others currently under review. The main publication published in its final form is a volume on Austroasiatic syntax in a diachronic and areal perspective (Brill 2020). This collection of papers was initiated before the start of the project but finalized during the project. All relevant chapters by PI Mathias Jenny and Postdoctoral researcher Hiram Ring are available on academia.edu repositories as pre-publication off-prints.

Forthcoming publications, including direct outputs of the project (apart from the above-mentioned monograph) and indirectly connected to the research activities of the project team, are:

● Lee, Wei-Wei. Under review. Subordination strategies in Rucing Palaung. JSEALS Special Publication: Proceedings of the 8th International Conference on Austroasiatic Linguistics (ICAAL8), August 29-31, 2019, Chiang Mai.

● Lee, Wei-Wei. Under review. Nominalization through verb-subject order in Rucing Palaung. Studies in Language.

● Lee, Wei-Wei & Mathias Jenny. Under review. Syntactic change in Palaungic: exploring the origins of an atypical Austroasiatic relative construction. Linguistics of the Tibeto-Burman Area.

● Lee, Wei-Wei. Verb-initial word order in Rucing Palaung: when nominal syntax goes clausal. PhD thesis University of Zurich.

The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny. Berlin/New York: de Gruyter Mouton. Expected to be published in 2021. More information here.

● Jenny, Mathias (2021). The national languages of Mainland Southeast Asia: Burmese, Thai, Lao, Khmer, Vietnamese. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Jenny, Mathias (2021). Pragmatics and syntax in the languages of Mainland Southeast Asia. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Jenny, Mathias (2021). Writing systems of Mainland Southeast Asia. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Sidwell, Paul & Jenny, Mathias (2021). Mainland Southeast Asian epigraphy. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Sidwell, Paul & Jenny, Mathias (2021). Introduction. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Sidwell, Paul & Jenny, Mathias (2021). History of Tai-Kadai studies. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Pacquement, Jean; Sidwell, Paul & Jenny, Mathias (2021). French contributions to the study of Mainland Southeast Asian languages and linguistics. In The Languages and Linguistics of Mainland Southeast Asia. Edited by Paul Sidwell & Mathias Jenny (to be published 2021). Berlin/New York: de Gruyter Mouton.

● Allassonnière-Tang, Marc & Hiram Ring. In Press. Sociocultural gender in nominal classification: A study of grammatical gender. Indian Linguistics.

● Ring, Hiram. 2020. Word order and the grammaticalization of gender in Khasian. In Alves, Mark, Mathias Jenny, & Paul Sidwell (eds.), Austroasiatic Syntax in Areal and Diachronic Perspective, Ch. 4, pp. 107–134. Leiden: Brill.

● Ring, Hiram. In Press. Gender, classifiers, and diachrony in Khasian. In Allassonnière-Tang, Marc & Marcin Kilarski (eds.), Nominal Classification in Asia: Functional and diachronic perspectives. Amsterdam: John Benjamins.

A major outcome of the project are databases of AA languages, partly digitized and annotated, partly raw recordings with metadata. All data have been temporarily archived on the server of the Department of Comparative Language Science, University of Zurich, until a final storage solution is found (possibly RWAAI[3] at the University of Lund). In the meantime, the data are made available upon request to interested researchers (possibly subject to personal and copy-right restrictions in some cases).

The majority of the Palaung data was collected within the present project, while for other languages different sources, including the team members’ earlier fieldwork, were used. Some of the language data was digitized from existing sources. The Khmuic data was sourced from the RWAAI archive (Lund University). Other data was digitized from printed sources (Katuic, Nicobarese), and yet other data was digitized and interlinearized with assistance from language specialists (Aslian).

To date, we have digitized and annotated data from 15 languages in 6 branches of Austroasiatic, amounting to 20,718 sentences in 188 texts across the family/phylum. Digitized data include languages from the Aslian branch (1 language, 8 texts, 896 sentences), the Katuic branch (2 languages, 47 texts, 5905 sentences), the Khasian branch (3 languages, 49 texts, 6121 sentences), the Khmuic branch (1 language, 15 texts, 1045 sentences), the Nicobarese branch (2 languages, 3 texts, 450 sentences), and the Palaungic branch (6 languages, 66 texts, 6301 sentences). Of this digitized data, 87 texts have been fully annotated for clause structure, resulting in 8753 sentences (16,980 clauses) tagged with argument/role identifiers at the lexical level. Data continues to be digitized for some of these languages and we plan to add more in the future.

Another important outcome is the set of Python scripts written to facilitate conversion of the digitized data and its analysis. As part of the workflow developed for this project, we wrote a set of scripts that convert and manipulate data from Toolbox (the tool used for interlinearizing texts) to spreadsheets for editing. Another script collates the annotated spreadsheets, preserving branch and language information along with the annotations, creating a database. This database can then be queried with several other Python scripts to generate statistical tables and graphs that illustrate the word order patterns of the languages and branches within the dataset. It can also be queried to automatically compare clauses, to extract verbs, and to extract complete clauses which are then reformatted as Excel spreadsheets for visual comparison. The scripts are currently being made ready for public use and will be added to a Github repository (https://github.com/lingdoc/V1_AA_project_scripts) as they become available.

6. Involvement of team members and cooperations

The PI Mathias Jenny was mainly in charge of the overall design and workflow of the project, including liaising between the project team and institutions in Myanmar and Thailand. He was the primary supervisor of the PhD student and supervisor of the BA and MA theses of two of the student assistants. Other tasks of the PI were planning and organization of workshops and conference panels, as well as individual and joint conference participation and publications.

The Postdoctoral researcher Hiram Ring was primarily tasked with operationalizing the computational comparative framework that the team devised (by writing and implementing Python scripts to [semi-]automatically compare/analyze the data) and managing the overall annotation process. As a specialist in Khasian languages of India he also managed the annotation of data from this branch and conducted fieldwork in Myanmar with the team to familiarize himself with the Palaungic context. He was also responsible for cleaning Khmuic and Katuic data from the RWAAI archive (Lund University), and for the digitization and annotation of Temiar data. He devoted much of his time to developing the team’s theoretical and methodological approach to syntactic reconstruction (in collaboration with the PI and PhD student) and was responsible for various project presentations and publications.

The PhD student Wei-Wei Lee wrote a cumulative dissertation within the framework of the project, consisting of three peer-reviewed papers. She also collaborated with the PI and post-doc on other project-related issues, including the development of the methodology and of the corpus annotation protocol. Moreover, the PhD student was the primary supervisor of student assistants who digitized, glossed, and annotated Palaung texts. She presented the project as well as her own PhD research at workshops and conferences. The PhD student also organized a two-day workshop together with the post-doc and two other early career researchers at the department.

The student assistants were actively involved not only in data standardization and annotation, but also conducted fieldwork together with the team or on their own in Myanmar and Thailand, and presented at an international conference (ICAAL 2019). The work of the student assistants resulted in one BA and one MA thesis on subjects related to the project.

Apart from the project team, we benefited from the cooperation within the host institute at the University of Zurich. Of special importance was the cooperation of Rachel Weymuth, PhD student at the Department of Comparative Language Science, who has been working on Palaungic languages for several years. Ms. Weymuth joined in many field trips to Myanmar and shared data from her individual research with the team. Also of great benefit were discussions with other department members who shared their expertise with us, including Prof. Balthasar Bickel and Prof. Paul Widmer.

Internationally, we cooperated with Dr. Paul Sidwell (University of Sydney), who gave invaluable input on many aspects of AA languages and linguistics, Geoffrey Benjamin (NTU Singapore), who shared his Temiar data with us and helped in translating and annotating it, and our partner institutions in Thailand and Myanmar. The following people also shared their data with us: Elizabeth Hall (Muak Sa-aak), Ma Seng Mai (Wa), Greg Blok (Lawa), Emily Lewis (Man Noi Plang), and Justin Watkins (Wa). Additionally, we collaborated closely with Nicole Kruspe, Niclas Burenhult, Joanne Yager, and others at the RWAAI archive (Lund University) to interlinearize and annotate existing digital language materials.

7. Challenges and difficulties encountered

Challenges and difficulties encountered include visa and travel restrictions, sickness and health related issues, and personal/family related concerns that have in some cases affected time that could be spent working. Some of these issues are to be expected when engaging in travel and fieldwork in developing countries like Myanmar and Thailand and were taken into account at the beginning of the project. Most of the practical issues could be solved thanks to the close cooperation with local universities in Myanmar (University of Mandalay, Yangon University of Foreign Languages), who took over the task of obtaining permission to conduct fieldwork in the country and made first contact with the local communities.

The major challenge to our work in the final phase of the project has been the COVID-19 pandemic. This has meant that we have been unable to travel, a key part of our data-gathering work and our work with language consultants. The pandemic has also kept our team apart for long periods of time, with one member remaining in Singapore for the final year of the project. Despite the difference in time zones and work hours, the project team has managed to largely stay on track and work collaboratively using online tools.

8. Conclusions, overall benefit of the project

In spite of the tight time frame and challenges in conducting fieldwork, the three-year project succeeded in collecting a substantial amount of primary data, annotating and analyzing existing material, and designing a machine-assisted methodology to work on diachronic syntax in the absence of historical data. Verb-initial word order as inherited feature could be convincingly shown for the Khasian group, and the way for continuing the research in the whole family by applying the developed methodology could be shown.

The more fine-grained research conducted by Ms. Lee in the Palaungic languages, Rucing Palaung in particular, led to an alternative explanation for the occurrence of verb-initial structures at least in this group. Rather than being inherited from Proto-Austroasiatic, these may have been innovated in Palaungic through finitization of originally nominal structures. This hypothesis certainly deserves further investigation, and at the same time highlights the challenges in a highly hypothetical field of research like diachronic syntax in the absence of historical data.

In the course of creating databases of several Austroasiatic languages, the project team ventured into Htanaw/Danau (dnu) a hitherto barely described Palaungic language spoken in South Shan State, Myanmar. The objective was to add a new datapoint to complement the Palaungic database. Due to the short time we could spend in the village in Myanmar in January 2020 and subsequent travel restrictions, only preliminary data could be collected. The initiated research is continuing, though, with online data collection, and has resulted in plans to devise a writing system for the so far unwritten language in cooperation with native speakers, especially of the Htanaw Youth Group. Although the recent military takeover in Myanmar and subsequent internet and travel restrictions are threatening continued cooperation with local institutions and native speakers in the country, it is hoped that the situation will be more conducive to academic exchange and research again in Myanmar in the not too far future.

We have additionally shown that there is promise in (semi-)automated approaches to syntactic reconstruction. Besides devising a workflow that facilitates the systematic development of a large and diverse database of syntactic data (to which data can continually be added), the scripts allow statistical information to be easily extracted from the dataset. This enables rapid assessment of where to focus efforts and allows comparanda to be easily observed by language experts, greatly speeding up the process of comparative syntactic reconstruction. Additionally, such a dataset allows new computational approaches to be devised and easily tested. As automated approaches for comparing or grouping clauses become better, more of the process can then be automated, increasing the speed of syntactic reconstruction even further, enabling greater insight into the histories of the people who speak these languages and into the development of verb-initial structures cross-linguistically.

References

Jimenez, Sergio, Claudia Becerra, Alexander Gelbukh, & Fabio Gonzalez. 2009. Generalized Monge-Elkan method for approximate text string comparison. In Gelbukh, Alexander (ed.), Computational Linguistics and Intelligent Text Processing, Vol. 5449 of Lecture Notes in Computer Science, pp. 559–570. CICLing 2009, Berlin: Springer. Doi: https://doi.org/10.1007/978-3-642-00382-0_45.

Levin, Beth. 1993. English verb classes and alternations: A preliminary investigation. Chicago: University of Chicago Press.

Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg Corrado, & Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546. URL http://arxiv.org/abs/1310.4546. 1310.4546.

Monge, Alvaro E. & Charles P. Elkan. 1996. The field-matching problem: algorithm and applications. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, pp. 267–270. AAAI.

Monge, Alvaro E. & Charles P. Elkan. 1997. An efficient domain-independent algorithm for detecting approximately duplicate database records. In The proceedings of the SIGMOD 1997 Workshop on Data Mining and Knowledge Discovery. ACM.



[1] https://software.sil.org/toolbox/

[2] http://xigt.org/

[3] https://projekt.ht.lu.se/rwaai/