COMS E6998: Topics in Computer Science

Machine Translation

Spring 2013

Syllabus

Time: R 4:10pm-6:00pm

Place: SEELEY W. MUDD 327

Instructors: Dr. Nizar Habash and Dr. Nadi Tomeh

Email: habash #åţ# ccls.columbia.edu

Office: 212-870-1289

Office Hours: R 6-8 at the Center for Computational Learning Systems

Teaching Assistant: Wael Salloum

Academic Integrity | Description |Readings

Resources | Requirements | Syllabus | Language-in-10 | Project Ideas

Description

This seminar course introduces students to research in machine translation.

Requirements

  • Each student will participate in leading class discussions once as presenter of one paper set.
  • Each student will participate in presenting one "Language in 10 minutes" presentation (alone or in a group of two).
  • All students (expect for the paper-set presenter) are expected to submit a short write up (no more than one page per paper) addressing all of the following:
    1. Summary of the paper: what is the problem definition? what is the approach? what are the results/conclusions?
    2. Critical Reading: pick three aspects of the paper that you either really like or really hate. Explain why.
    3. Questions for discussion: write up two or more questions that you think would be interesting to discuss in the class.
    4. Please bring the write up to the class and hand in to the instructor afterwards. If you cannot attend the class, send the write up by email prior to the class. Please type the write up -- handwritten write ups are not accepted. Grades for the write ups will be part of class participation. Expect to be asked in the class to participate and tell us what your questions/likes/dislikes are.
  • Students will prepare a term project. This will include submitting project proposal, interim report and final report in addition to a demo and class presentation. The students can work individually or in teams.
  • All students are required to have a Computer Science Account for this class. To sign up for one, go to the CRF website and then click on "Apply for an Account".

Each session except as indicated in the Syllabus will include:

  • 10-15 minute introductions to facts about some language, its available resources, how much work in MT is done on it, how it is interesting research-wise, etc. 6-8 slides as needed. Examples are good.
  • 45 minutes: first paper set (can be one or two topic-related papers) to be presented by a student presenter (~20 minutes) followed by group discussion (~25 minutes).
  • 10 minute break
  • 45 minute: second paper set (same as (b))

Grade in this course will be calculated as follows:

  • 10% MT Lab report + 1-minute presentation of what-worked-what-didn't
  • 20% Class presentation
  • 20% Class participation (readings and discussions)
  • 5% Project proposal report (1 page plan + 1-minute presentation)
  • 10% Project Midterm report
  • 35% Final project report (25%), presentation (10%) and demo (pass/fail)

Important Dates

All deadlines are by 11:59pm (ET) of the due date unless otherwise specified.

  1. MT Lab deliverables due Feb 15, 2013 (midnight)
  2. MT Lab 1-minute presentation Feb 21, 2013 (in class)
  3. Project proposal due Feb 28, 2013 (midnight)
  4. Project proposal 1-minute presentation due Feb 28 (in class)
  5. Project midterm report due Mar 28, 2013 (midnight)
  6. Project presentation due May 2, 2013 (in class)
  7. Project demo should be schedules with TA in the week of May 2-May 9.
  8. Project final report due May 9, 2013 (midnight)

Academic Integrity

Copying or paraphrasing someone's work (code included), or permitting your own work to be copied or paraphrased, even if only in part, is not allowed, and will result in an automatic grade of 0 for the entire assignment or exam in which the copying or paraphrasing was done. Your grade should reflect your own work. If you believe you are going to have trouble completing an assignment, please talk to the instructor or TA in advance of the due date.

Readings

Philipp Koehn's book Statistical Machine Translation is recommended but not required.

All required readings will be available on line.

Resources

Syllabus

Language in 10 Minutes

This is a short presentation of around 10 minutes on a particular language, e.g., Arabic, Chinese, Czech, Hindi, Italian, Ewe, or Maltese.

For each language, the student will prepare (three to six) slides on a language they do not speak natively. The slides must cover (1) Language Facts (demographics, location, etc.) (2) Important linguistic characteristics (orthography, morphology, syntax) and (3) computational efforts such as resources, tools, papers -- e.g., how many entries in MT Archive? and what are they generally on? Be creative and have fun with this. Asking for help from native speakers or language experts is ok. But the student is ultimately responsible for the presentation.

Examples from previous presentations are also available here.


Resources that can help your preparation of slides:

Project Ideas

  • Improve on a baseline SMT system; any of the papers you read can be a base for you to improve on.
    • smart OOV handling
    • improving word alignment
    • learning models of syntax reordering in MT
    • using rich resources in English to improve translation into English or from English, e.g. English parsing for translation into Arabic
  • Work on Named Entity Transliteration.
  • (How many ways to spell Qadafi: Kadhafy, Kaddafi, Gadaffy, etc.)
  • Smart pivoting through a third language. Can we learn these constraints automatically?
  • Using MT to improve monolingual tools by using parallel text to automatically annotate data for morphology and/or syntax.
  • Use other language pairs to improve your system:
  • Other ideas?
  • Look up papers in the ACL anthology/MT Archive about topics and languages that interest you.

Midterm Report

The midterm report must include the following:

a. Introduction and problem definition

b. Literature review (at least 5 papers)

c. Description of resources used. This may includes stats on data, OOV rates and the like.

d. Baseline results (comparable to MT lab but for your language).

e. Analysis of errors in baseline based on a sample (not less than 20 sentences and looking at English side only); focus on the problem you are targeting.

f. bibliography of cited papers.

The midterm report should be about half the length of a conference paper (so - 3-4 pages single spaced or 6-8 pages double spaced)

Final Report

Final report should be in the style of an ACL publication: 8 page double column, plus any # of pages for references. You will see many examples in the class.