I am a PhD student in USC/ISI, my advisor is Prof. David Chiang.
Open Source For NLP Research
- Egret: A set of syntax parsing tools written by me, including inside-outside grammar estimation, pure PCFG parser, and a latent-annotation-based syntax parser (a reimplementation of Berkeley Parser, with F1-Measure: Chinese 84.16%, English 89.02%) which features static and dynamic pruning, and can output n-best trees and packed forests.
- ExtractForest: Modification for Prof. Charniak's parser to output packed forest.
- Kneser-Ney Smoothing on Expected Counts
Hui Zhang and David Chiang. ACL-2014. [paper] [slides]
Widely used in speech and language processing, Kneser-Ney (KN) smoothing has consistently been shown to be one of the best-performing smoothing methods. However, KN smoothing assumes integer counts, limiting its potential uses -- for example, inside Expectation-Maximization. In this paper, we propose a generalization of KN smoothing that operates on fractional counts, or, more precisely, on distributions over counts. We rederive all the steps of KN smoothing to operate on count distributions instead of integral counts, and apply it to two tasks where KN smoothing was not applicable before: one in language model adaptation, and the other in word alignment. In both cases, our method improves performance significantly.
- Observational Initialization of Type-Supervised Taggers
Hui Zhang and John DeNero. ACL-2014. [paper] [slides]
Recent work has sparked new interest in type-supervised part-of-speech tagging, a data setting in which no labeled sentences are available, but the set of allowed tags is known for each word type. This paper describes observational initialization, a novel technique for initializing EM when training a type-supervised HMM tagger. Our initializer allocates probability mass to unambiguous transitions in an unlabeled corpus, generating token-level observations from type-level supervision. Experimentally, observational initialization gives state-of-the-art type-supervised tagging accuracy, providing an error reduction of 56% over uniform initialization on the Penn English Treebank.
- Beyond Left-to-Right: Multiple Decomposition Structures for SMT
Hui Zhang, Kristina Toutanova, Chris Quirk and Jianfeng Gao. NAACL-2013. [paper]
Standard phrase-based translation
models do not explicitly model context dependence between translation units. As
a result, they rely on large phrase pairs and target language models to recover
contextual effects in translation. In this work, we explore language models
over Minimal Translation Units (MTUs) to explicitly capture contextual
dependencies across phrase boundaries in the channel model. As there is no
single best direction in which contextual information should flow, we explore
multiple decomposition structures as well as dynamic bidirectional
decomposition. %based on the scores of candidates. The resulting models are
evaluated in an intrinsic task of lexical selection for MT as well as a full MT
system, through n-best reranking. These experiments demonstrate that additional
contextual modeling does indeed benefit a phrase-based system and that the
direction of conditioning is important. Integrating multiple conditioning
orders provides consistent benefit, and the most important directions differ by
- An exploration of forest-to-string translation: Does translation help or hurt parsing?
Hui Zhang and David Chiang. ACL-2012. [paper]
Syntax-based translation models
that operate on the output of a source-language parser have been shown to
perform better if allowed to choose from a set of possible parses. In this paper,
we investigate whether this is because it allows the translation stage to
overcome parser errors or to override the syntactic structure itself. We find that
it is primarily the latter, but that under the right conditions, the
translation stage does correct parser errors, improving parsing accuracy on the
- Non-isomorphic Forest Pair Translation.
Hui Zhang, Min Zhang, Haizhou Li, Eng Siong Chng. EMNLP-2010.[paper][slides]
This paper studies two issues, non-isomorphic structure translation and target syntactic structure usage, for statistical machine translation in the context of forest-based tree to tree sequence trans-lation. For the first issue, we propose a novel non-isomorphic translation framework to capture more non-isomorphic structure mappings than tra-ditional tree-based and tree-sequence-based trans-lation methods. For the second issue, we propose a parallel space searching method to generate hypo-thesis using tree-to-string model and evaluate its syntactic goodness using tree-to-tree/tree sequence model. This not only reduces the search complexity by merging spurious-ambiguity translation paths and solves the data sparseness issue in training, but also serves as a syntax-based target language model for better grammatical generation. Experiment results on the benchmark data show our proposed two solutions are very effective, achieving significant performance improvement over baselines when applying to different translation models.
- Convolution Kernel over Packed Parse Forest.
Min Zhang, Hui Zhang, Haizhou Li. ACL-2010. [paper]
This paper proposes a convolution forest kernel to effectively explore rich structured features embedded in a packed parse forest. As opposed to the convolution tree kernel, the proposed forest kernel does not have to commit to a single best parse tree, is thus able to explore very large object spaces and much more structured features embedded in a forest. This makes the proposed kernel more robust against parsing errors and data sparseness issues than the convolution tree kernel. The paper presents the formal definition of convolution forest kernel and also illustrates the computing algorithm to fast compute the proposed convolution forest kernel. Experimental results on two NLP applications, relation extraction and semantic role labelling, show that the proposed forest kernel significantly outperforms the baseline of the convolution tree kernel.
- Forest-based Tree Sequence to String Translation Model.
Hui Zhang, Min Zhang, Haizhou Li, Aiti Aw, Chew Lim Tan. ACL-2009. [paper]. [sildes]
This paper proposes a forest-based tree sequence to string translation model for syntaxbased statistical machine translation, which automatically learns tree sequence to string translation rules from word-aligned sourceside-parsed bilingual texts. The proposed model leverages on the strengths of both tree sequence-based and forest-based translation models. Therefore, it can not only utilize forest structure that compactly encodes exponential number of parse trees but also capture nonsyntactic translation equivalences with linguistically structured information through tree sequence. This makes our model potentially more robust to parse errors and structure divergence. Experimental results on the NIST MT-2003 Chinese-English translation task show that our method statistically significantly outperforms the four baseline systems.
- Fast Translation Rule Matching for Syntax-based Statistical Machine Translation.
Hui Zhang, Min Zhang, Haizhou Li, Chew Lim Tan. EMNLP-2009. [paper].[slides]
In a linguistically-motivated syntax-based translation system, the entire translation process is normally carried out in two steps, translation rule matching and target sentence decoding using the matched rules. Both steps are very time-consuming due to the tremendous number of translation rules, the exhaustive search in translation rule matching and the complex nature of the translation task itself. In this paper, we propose a hyper-tree-based fast algorithm for translation rule matching. Experimental results on the NIST MT-2003 Chinese-English translation task show that our algorithm is at least 19 times faster in rule matching and is able to help to save 57% of overall translation time over previous methods when using large fragment translation rules.
- K-Best Combination of Syntactic Parsers.
Hui Zhang, Min Zhang, Chew Lim Tan, Haizhou Li. EMNLP-2009. [paper]
In this paper, we propose a linear model-based general framework to combine k-best parse outputs from multiple parsers. The proposed framework leverages on the strengths of previous system combination and re-ranking techniques in parsing by integrating them into a linear model. As a result, it is able to fully utilize both the logarithm of the probability of each k-best parse tree from each individual parser and any additional useful features. For feature weight tuning, we compare the simulated-annealing algorithm and the perceptron algorithm. Our experiments are carried out on both the Chinese and English Penn Treebank syntactic parsing task by combining two state-of-the-art parsing models, a head-driven lexicalized model and a latent-annotation-based un-lexicalized model. Experimental results show that our F-Scores of 85.45 on Chinese and 92.62 on English outperform the previously best-reported systems by 1.21 and 0.52, respectively.
- Conference Review: ACL-IJCNLP-2009, EMNLP-2009, SIGIR-2009, NAACL-HLT-2010, ACL-2010, COLING-2010, EMNLP-2010
- Member of the Association for Computational Linguistics (07/2009 ~ present)
- Member of the Chinese and Oriental Languages Information Processing Society (07/2009 ~ present)
Hobbies: Long Distance Running, Badminton, Music