I am a first year PhD student in USC/ISI, my advisor is Prof. David Chiang.
Open Source For NLP Research
- Egret: A set of syntax parsing tools written by me, including inside-outside grammar estimation, pure PCFG parser, and a latent-annotation-based syntax parser (a reimplementation of Berkeley Parser, with F1-Measure: Chinese 84.16%, English 89.02%) which features static and dynamic pruning, and can output n-best trees and packed forests.
- ExtractForest: Modification for Prof. Charniak's parser to output packed forest.
- Chorus: A Multiple-Parsers Combination Engine. (available upon request)
Publications
- Non-isomorphic Forest Pair Translation.
Hui Zhang, Min Zhang, Haizhou Li, Eng Siong Chng. EMNLP-2010.[paper][slides]
This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence trans-lation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than tra-ditional tree-based and tree-sequence-based trans-lation methods. For the second issue, we propose a parallel space searching method to generate hypo-thesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.
- Convolution Kernel over Packed Parse Forest.
Min Zhang, Hui Zhang, Haizhou Li. ACL-2010. [paper]
This paper proposes a convolution forest kernel to effectively explore rich structured features embedded in a packed parse forest. As opposed to the convolution tree kernel, the proposed forest kernel does not have to commit to a single best parse tree, is thus able to explore very large object spaces and much more structured features embedded in a forest. This makes the proposed kernel more robust against parsing errors and data sparseness issues than the convolution tree kernel. The paper presents the formal definition of convolution forest kernel and also illustrates the computing algorithm to fast compute the proposed convolution forest kernel. Experimental results on two NLP applications, relation extraction and semantic role labelling, show that the proposed forest kernel significantly outperforms the baseline of the convolution tree kernel.
- Forest-based Tree Sequence to String Translation Model.
Hui Zhang, Min Zhang, Haizhou Li, Aiti Aw, Chew Lim Tan. ACL-2009. [paper]. [sildes]
This paper proposes a forest-based tree sequence to string translation model for syntaxbased statistical machine translation, which automatically learns tree sequence to string translation rules from word-aligned sourceside-parsed bilingual texts. The proposed model leverages on the strengths of both tree sequence-based and forest-based translation models. Therefore, it can not only utilize forest structure that compactly encodes exponential number of parse trees but also capture nonsyntactic translation equivalences with linguistically structured information through tree sequence. This makes our model potentially more robust to parse errors and structure divergence. Experimental results on the NIST MT-2003 Chinese-English translation task show that our method statistically significantly outperforms the four baseline systems.
- Fast Translation Rule Matching for Syntax-based Statistical Machine Translation.
Hui Zhang, Min Zhang, Haizhou Li, Chew Lim Tan. EMNLP-2009. [paper].[slides]
In a linguistically-motivated syntax-based translation system, the entire translation process is normally carried out in two steps, translation rule matching and target sentence decoding using the matched rules. Both steps are very time-consuming due to the tremendous number of translation rules, the exhaustive search in translation rule matching and the complex nature of the translation task itself. In this paper, we propose a hyper-tree-based fast algorithm for translation rule matching. Experimental results on the NIST MT-2003 Chinese-English translation task show that our algorithm is at least 19 times faster in rule matching and is able to help to save 57% of overall translation time over previous methods when using large fragment translation rules.
- K-Best Combination of Syntactic Parsers.
Hui Zhang, Min Zhang, Chew Lim Tan, Haizhou Li. EMNLP-2009. [paper]
In this paper, we propose a linear model-based general framework to combine k-best parse outputs from multiple parsers. The proposed framework leverages on the strengths of previous system combination and re-ranking techniques in parsing by integrating them into a linear model. As a result, it is able to fully utilize both the logarithm of the probability of each k-best parse tree from each individual parser and any additional useful features. For feature weight tuning, we compare the simulated-annealing algorithm and the perceptron algorithm. Our experiments are carried out on both the Chinese and English Penn Treebank syntactic parsing task by combining two state-of-the-art parsing models, a head-driven lexicalized model and a latent-annotation-based un-lexicalized model. Experimental results show that our F-Scores of 85.45 on Chinese and 92.62 on English outperform the previously best-reported systems by 1.21 and 0.52, respectively.
Professional Activities
- Conference Review: ACL-IJCNLP-2009, EMNLP-2009, SIGIR-2009, NAACL-HLT-2010, ACL-2010, COLING-2010, EMNLP-2010
- Member of the Association for Computational Linguistics (07/2009 ~ present)
- Member of the Chinese and Oriental Languages Information Processing Society (07/2009 ~ present)
Misc
Hobbies: Long Distance Running, Badminton, Music
|
|