Preprint here.
Abstract:
URIEL is a knowledge base offering geographical, phylogenetic, and typological vector representations for 7970 languages. It includes distance measures between these vectors for 4005 languages, which are accessible via the lang2vec tool. Despite being frequently cited, URIEL is limited in terms of linguistic inclusion and overall usability. To tackle these challenges, we introduce URIEL+, an enhanced version of URIEL and lang2vec addressing these limitations. In addition to expanding typological feature coverage for 2898 languages, URIEL+ improves user experience with robust, customizable distance calculations to better suit the needs of the users. These upgrades also offer competitive performance on downstream tasks and provide distances that better align with linguistic distance studies.
Presented to 2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) SRW for presentation.
Our paper here.
Our poster here.
Abstract:
In the pursuit of supporting more languages around the world, tools that characterize properties of languages play a key role in expanding the existing multilingual NLP research. In this study, we focus on a widely used typological knowledge base, URIEL, which aggregates linguistic information into numeric vectors. Specifically, we delve into the soundness and reproducibility of the approach taken by URIEL in quantifying language similarity. Our analysis reveals URIEL's ambiguity in calculating language distances and in handling missing values. Moreover, we find that URIEL does not provide any information about typological features for 31% of the languages it represents, undermining the reliabilility of the database, particularly on low-resource languages. Our literature review suggests URIEL and lang2vec are used in papers on diverse NLP tasks, which motivates us to rigorously verify the database as the effectiveness of these works depends on the reliability of the information the tool provides.
Presented at The 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2024), March 2024, St. Julian's, Malta.
Our paper here.
Our poster here.
Our presentation here.
Abstract:
Fine-tuning and testing a multilingual large language model is a challenge for low-resource languages (LRLs) since it is an expensive process. While previous studies have predicted the performance of natural language processing (NLP) tasks using machine learning methods, they primarily focus on high-resource languages, overlooking LRLs and shifts across domains. Focusing on LRLs, we investigate three factors (the size of the fine-tuning corpus, domain similarity between fine-tuning and testing corpora, and language similarity between source and target languages), which can potentially impact the model performance by using classical regression models. Our results indicate that domain similarity has the most important impact on predicting the performance of Machine Translation models.
Published on Educational Studies in Mathematics, March 2024.
Our paper here.
Abstract:
Using Balacheff’s (2013) model of conceptions, we inferred potential conceptions in three examples presented in the spanning sets section of an interactive linear algebra textbook. An analysis of student responses to two similar reading questions revealed additional strategies that students used to decide whether a vector was in the spanning set of a given set of vectors. An analysis of the correctness of the application of these strategies provides a more nuanced understanding of student responses that might be more useful for instructors than simply classifying the responses as right or wrong. These findings add to our knowledge of the textbook’s presentation of span and student understanding of span. We discuss implications for research and practice.
Presented at MAA MathFest, August 2023, Florida, USA.
Presented at the Undergraduate Research Opportunity Program’s (UROP) Spring Research Symposium, April 2023, Michigan, USA.
Received Blue Ribbon Outstanding Presenter Award.
Our presentation here.
Our poster here.
Abstract:
Reading questions are an interactive feature added to textbooks to inform instructors of their students’ understanding of the materials based on their reading of a section before a lesson. Using Balacheff’s (2009) model of conceptions, we analyzed three examples on spanning sets in a linear algebra textbook (Beezer, 2021) and compared with the conceptions emerged from students’ responses to two reading questions in the section. Our analysis revealed additional control structures illustrating further conceptions beyond those proposed by the textbook. Evaluating the correctness of using each control structure we uncover potential issues related to their applicability. Future work includes analysis of conceptions across different sections to track how students’ conceptions evolve over time and how its development impact the correctness in application.
Presented at The 13th Congress of the European Society for Research in Mathematics Education (CERME13), July 2023, Budapest, Hungary.
Our paper here.
Abstract:
Using Balacheff’s (2013) model of conceptions we analyzed textbook examples in two sections that modeled the mathematical work needed to answer two reading questions and used the intended conceptions to identify control structures in student responses to those questions. Reading questions are an interactive textbook feature meant to entice students to read the textbook before attending the lesson when such ideas will be discussed; as students provide responses in their interactive textbook, the instructors can learn about how students are thinking about the content before a lesson. We found additional control structures, which suggest additional conceptions beyond the ones promoted in the textbook. We discuss implications for designing these types of questions in interactive textbooks.