This seminar course introduces students to research in machine translation.
- Each student will participate in leading class discussions once as presenter of one paper set.
- Each student will participate in presenting one "Language in 10 minutes" presentation (alone or in a group of two).
- All students (expect for the paper-set presenter) are expected to submit a short write up (no more than one page per paper) addressing all of the following:
Students will prepare a term project. This will include submitting project proposal, interim report and final report in addition to a demo and class presentation. The students can work individually or in teams.
All students are required to have a Computer Science Account for this class. To sign up for one, go to the CRF website and then click on "Apply for an Account".
- Summary of the paper: what is the problem definition? what is the approach? what are the results/conclusions?
- Critical Reading: pick three aspects of the paper that you either really like or really hate. Explain why.
- Questions for discussion: write up two or more questions that you think would be interesting to discuss in the class.
- Please bring the write up to the class and hand in to the instructor afterwards. If you cannot attend the class, send the write up by email prior to the class. Please type the write up -- handwritten write ups are not accepted. Grades for the write ups will be part of class participation. Expect to be asked in the class to participate and tell us what your questions/likes/dislikes are.
Each session except as indicated in the Syllabus will include:
- 10-15 minute introductions to facts about some language, its available resources, how much work in MT is done on it, how it is interesting research-wise, etc. 6-8 slides as needed. Examples are good.
- 45 minutes: first paper set (can be one or two topic-related papers) to be presented by a student presenter (~20 minutes) followed by group discussion (~25 minutes).
- 10 minute break
- 45 minute: second paper set (same as (b))
Grade in this course will be calculated as follows:
- 10% MT Lab report + 1-minute presentation of what-worked-what-didn't
- 20% Class presentation
- 20% Class participation (readings and discussions)
- 5% Project proposal report (1 page plan + 1-minute presentation)
- 10% Project Midterm report
- 35% Final project report (25%), presentation (10%) and demo (pass/fail)
All deadlines are by 11:59pm (ET) of the due date unless otherwise specified.
- MT Lab deliverables due Feb 15, 2013 (midnight)
- MT Lab 1-minute presentation Feb 21, 2013 (in class)
- Project proposal due Feb 28, 2013 (midnight)
- Project proposal 1-minute presentation due Feb 28 (in class)
- Project midterm report due Mar 28, 2013 (midnight)
- Project presentation due May 2, 2013 (in class)
- Project demo should be schedules with TA in the week of May 2-May 9.
- Project final report due May 9, 2013 (midnight)
Copying or paraphrasing someone's work (code included), or permitting your own work to be copied or paraphrased, even if only in part, is not allowed, and will result in an automatic grade of 0 for the entire assignment or exam in which the copying or paraphrasing was done. Your grade should reflect your own work. If you believe you are going to have trouble completing an assignment, please talk to the instructor or TA in advance of the due date.
Philipp Koehn's book Statistical Machine Translation is recommended but not required.
All required readings will be available on line.
||Topic & Readings
||Reports & HW
Collect language expertise information
||MT Lab assigned: build an SMT system in two weeks (due Feb 15)
Project proposal ideas/description due Feb 28 at class time. Plan to meet with the instructors ahead of time to discuss your proposal.
- Language in 10 minutes (French, Presenter: Nadi)
- Decoding for phrase-based translation models
- Discriminative training of translation models
|MT Lab deliverables due tomorrow Feb 15th at midnight!|
EXTENDED TO SUNDAY Feb 17th midnight!
||MT Lab 1-minute presentation of what worked or didn't
Ann Clifton; Anoop Sarkar. Combining Morpheme-based Machine Translation with Post-processing Morpheme Prediction.ACL 2011.
- Philip Williams & Philipp Koehn: Agreement constraints for statistical machine translation into German. Workshop on Statistical Machine Translation, 2011.
|Project proposal due today!
Project 1-minute presentations today!
Assign- Project midterm report: lit review, initial results & analysis (6-8 pages double spaced) due Mar 14. More details here.
- Language in 10 minutes (Swedish Presenter Louis)
- Project Proposal Discussion (all students)
- Project proposal 1-minute presentation
- Mei Yang & Katrin Kirchhoff: Phrase-Based Backoff Models for Machine Translation of Highly Inflected Languages. EACL 2006.
- Nizar Habash: Four techniques for online handling of out-of-vocabulary words in Arabic-English statistical machine translation. ACL 2008.
||Project midterm report due today! DEADLINE EXTENDED TILL AFTER SPRING BREAK (MAR 28)
|Mar 21||Spring Recess - Have fun!||
||NEW DEADLINE: Project midterm report due today!
- A.Cüneyd Tantuğ, Eşref Adali, & Kemal Oflazer: Machine translation between Turkic languages. ACL 2007.
- Nadir Durrani, Hassan Sajjad, Alexander Fraser & Helmut Schmid: Hindi-to-Urdu Machine Translation through Transliteration. ACL 2010.
- Language in 10 minutes (Tagalog Presenter Sadegh)
- SET 1 (Presenter Venkataraman)
SET 2 (Presenter Class Discussion)
- John DeNero, Dan Gillick, James Zhang and Dan Klein. Why generative phrase models underperform sufrace heuristics. NAACL 2006.
- Mohit Bansal, Chris Quirck and Robert C. Moore. Gappy phrasal alignment by agreement. ACL 2011.
- John DeNero, Alexandre Bouchard-Côté and Dan Klein.Sampling alignment structure under a bayesian translation model. EMNLP 2008.
||Project final report is due May 9
Have a Good Summer!
This is a short presentation of around 10 minutes on a particular language, e.g., Arabic, Chinese, Czech, Hindi, Italian, Ewe, or Maltese.
For each language, the student will prepare (three to six) slides on a language they do not speak natively. The slides must cover (1) Language Facts (demographics, location, etc.) (2) Important linguistic characteristics (orthography, morphology, syntax) and (3) computational efforts such as resources, tools, papers -- e.g., how many entries in MT Archive? and what are they generally on? Be creative and have fun with this. Asking for help from native speakers or language experts is ok. But the student is ultimately responsible for the presentation.
Examples from previous presentations are also available here.
Resources that can help your preparation of slides:
Here are some ideas for projects.
- Improve on a baseline SMT system; any of the papers you read can be a base for you to improve on.
- smart OOV handling
- improving word alignment
- learning models of syntax reordering in MT
- using rich resources in English to improve translation into English or from English, e.g. English parsing for translation into Arabic
- Work on Named Entity Transliteration.
(How many ways to spell Qadafi: Kadhafy, Kaddafi, Gadaffy, etc.)
- Smart pivoting through a third language. Can we learn these constraints automatically?
- Using MT to improve monolingual tools by using parallel text to automatically annotate data for morphology and/or syntax.
- Use other language pairs to improve your system:
- Other ideas?
Look up papers in the ACL anthology/MT Archive about topics and languages that interest you.
|The midterm report must include the following:
a. Introduction and problem definition
b. Literature review (at least 5 papers)
c. Description of resources used. This may includes stats on data, OOV rates and the like.
d. Baseline results (comparable to MT lab but for your language).
e. Analysis of errors in baseline based on a sample (not less than 20 sentences and looking at English side only); focus on the problem you are targeting.
f. bibliography of cited papers.
The midterm report should be about half the length of a conference paper (so - 3-4 pages single spaced or 6-8 pages double spaced)
Final report should be in the style of an ACL publication: 8 page double column, plus any # of pages for references. You will see many examples in the class.