Critique of a research paper related to my project.
For this week, I chose to critique Anders Søgaard’s paper from 2020: Some Languages Seem Easier to Parse Because Their Treebanks Leak. This paper seeks to find if overlap between unlabeled graphs in the training section and test section (leakage) of data affect dependency parser performance. It uses the VF2 algorithm to compute isomorphisms and considers undirected, unlabeled graphs and directed, unlabeled graphs. This paper then compares treebank leakage to other factors which affect parser performance. There are many different factors that can affect performance of a parser, the most important and strongest correlation is from the treebank size, which is so important that all other factors are correlated in combination with treebank size. Next the morphology is an important factor, and in rich morphological languages, relations are expressed implicitly with word order and encoded in morphological affixes. Also sentence length affects parser performance because it depends on input length (the search space of possible parses). Open class ratio is also a predictor of parser performance because certain classes (like nouns and verbs) are especially hard to attach to other parts of speech. Finally POS bigram perplexity, domain divergence, and graph properties are other predictors explored in this paper. The results found that treebank size correlates strongly (which is well known), morphological complexity and open class ratio are not very predictive, domain divergence correlates strongly, dependency length is weakly correlated, sentence length is more strongly correlated, and leakage is more predictive than any of these factors (aside from treebank size). Furthermore, using directed as opposed to undirected graphs was shown to be slightly more correlated to good parser performance.
In this paper I learned about all of the many (but probably not all) factors which can affect parser performance. It was also interesting to learn about how important each factor was compared to the others. For instance treebank size is the most important factor, whereas other factors do not affect parser performance in any significant way. In our research, we aim to further the work which this paper did. We have already (for the most part) replicated the work of this paper, and now we want to know: Does using labeled (edge labels and POS labels) graph isomorphisms affect parser performance? How do subgraphs (labeled and unlabeled) affect parser performance? What type of subgraphs are most predictive of parser performance (looking at one node and its predecessor/many successors, looking at the graph without any modifiers, looking at two subgraphs connected by conjunctions…)? The questions which I have posed are what we aim to explore in our research in the coming weeks.
References:
Anders Søgaard. Some Languages Seem Easier to Parse Because Their Treebanks Leak. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2765–2770, 2020.