Mining Bitext

Group Members: Bowen Chu, Lingbo Hu, Yizhan Wu

This project aims to introduce a methodology for constructing aligned English-Simplified Chinese corpora from movie subtitles. Subtitles that consist of two languages usually provide viewers with alignment of sentences manually done by the author. Since the common length-based algorithm for alignment is not desirable when provided with short spoken sentences, we present a simple methodology to use statistical lexical cues to align the subtitle. This is also avaible solution for improving machine translation systems.