related reading: Jurafsky and Martin, chapter 25
Machine translation is one of the oldest and most obvious things to do with natural language processing.
Research has been ongoing for just about as long as there have been digital computers at all.
Why is it important? Well, there are lots and lots of languages in the world! Thousands!
It depends on how you divide them up! And that's often a political question more than a linguistic one. It takes some political power to say that the way your group speaks is a language rather than a dialect, or even rather than "they're just uneducated clods who can't speak correctly".
Bosnian, Croatian and Serbian are all considered (loudly!) to be different languages.
There are thirteen languages **in India** that have over ten million native speakers. Six of those have over 50 million native speakers. http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers_in_India
Worldwide, here are the biggest languages:
http://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers
Well, one possible simplest thing, is we could just look up each word in a dictionary, one at a time, and write down the first entry for that word.
Here's a sentence in Spanish! It's about the first lady of France, Valérie Trierweiler. From http://elpais.com/elpais/2012/11/21/gente/1353500917_790166.html
En ella declara que se siente aliviada de no ser ya objeto de polémica por sus declaraciones y que está tan cómoda en la residencia oficial que no le importaría repetir en un segundo mandato.
Here's the translation into English, via that approach.
In she ??? that to him to feel ??? of no to be already object of controversy for ??? declarations and that ??? so bureau in her residence official that no to her ??? to repeat in one second term of office.
I, uh, wouldn't be very proud to write a program that does that well.
Let's try it with Google Translate!
It states that he feels relieved not longer subject to controversy for his remarks and is as comfortable in the official residence would not mind repeating for a second term.
That's significantly better! What does Google Translate know? What does it not know?
- Notably, it doesn't choke when it sees inflected forms of Spanish verbs! My dictionary didn't have most of the inflected verbs in it, hence the question marks.
- It knows what's up with "se siente", that that's a phrase that hangs together -- it doesn't seem to try to translate those two words separately.
- It doesn't know about the gender of the subject. But that's actually ambiguous, without the context of the article. Words like "se" and "le" here don't carry gender, so Google Translate has to guess, and apparently masculine pronouns fall out of the numbers.
- It doesn't know what "ella" (she) is referring to, but we only gave it the one sentence! It turns out that "ella" here should be more like "it" in English, since it's referring to "una entrevista" (an interview).
- "not longer" is sort of a terrible n-gram. Curious how that came about.
- long-distance dependency here is screwed up: "tan cómoda ... que ..."
Languages are just *different* from one another!
They express roughly the same information. It seems to be the case that you can express whatever idea you want with whatever language, but what you *must* include to be grammatical, that varies quite a lot across languages. Put another way, each language lets you leave different things vague.
Broadly, we call the grammatical differences between languages translation divergences, and there are quite a few different types.
Linguistic typology is the study of the systematic ways that features in languages seem to cluster. We talked before about different kinds of information that can be encoded in morphology, and how that can work: we've got isolating/analytic languages, then different kinds of morphological richness: agglutinative languages, fusional languages, etc.
Then there are syntactic differences between kinds of languages. Even if we were able to sensibly translate each word (and we've shown that this is hard), we still need to have some mechanism to get the output words in the right order.
English, Mandarin, German and Spanish are all SVO languages, meaning that they typically arrange sentences: subject, verb, object. Not all languages are like that!
There are six conceivable orders that you might want to do! SVO and SOV are the most common. Orders that start with O are apparently out there, but very rare!
Then there's adpositions, where you put them, ergative/absolutive languages (versus nominative/accusative languages), and different ways of framing verbs (especially verbs of motion) ...
And Dorr's seven different divergences!!
- thematic
- promotional
- demotional
- categorial: I am hungry vs Ich habe Hunger
- structural divergence: John entered the house vs Juan entró en la casa
- lexical: "broke into the room" versus "forced entry into the room" ...
- conflational: "The meaning of the sentence may be distributed to different words in the other language..."
You can think about different approaches to MT with the Vauquois triangle. This doesn't totally specify algorithms, of course, but it lets you situate them. You can add more analysis, and you can go to ever-more abstract representations.
This brings up the really interesting problem: we have a bunch of good formalisms for representing syntax. How do we represent the meaning of a sentence? Well, here's one possibility.
Break it down into three steps:
- analysis
- transfer
- generation
- like Apertium!!
- basically like case-based reasoning
- These days, we've got lots and lots of data: let's make use of it!
The point here is that we have lots and lots of text these days. And also lots of compute power!
So just given this little bit of text, even if we don't know Italian, we can learn something about it! Which of these words is most likely to mean "food" in Italian, do you think?
(well, which word shows up in all of the "food" sentences?)
Which word in English shows up every time we see the Italian word "gatti" ?
Remember that Noisy Channel Model idea we've been kicking around?
output = argmaxT P(T | S)
What does this one mean? This one means that we want to get the best T (target language sentence) -- it's the one that has the highest probability, conditioned on some source-language sentence S.
Very often, when you see people writing about probability in translation, they use E and F, for English and Foreign, or perhaps F for French. Because the IBM papers talk about French.
But those really just stand in for the source language and the target language.
Well, what did we do every single time we had a conditional probability in this class so far?
Every. Single. Time.
We used Bayes Theorem!! So let's go ahead and do that. How do we rewrite P(T | S) ?
argmaxT P(S | T) * P(T) / P(S)
OK, and P(S) is fixed anyway, so...
argmaxT P(S | T) * P(T)
Let's break this apart...
There are four parts I want to call out.
Starting from the right...
P(T) --> what do we call a function that lets us estimate the probability of a target language sentence?
P(S | T) ? We call that the translation model. There are lots of different ways to do this one, but we'll talk about it a bit more.
... and then argmax. Given that we have some way to estimate how good a translation is, for a given input sentence, all we have to do is try all possible translations!! Easy, right?
Just try all length one translations, then all length two translations...
There's only a countably infinite number of sequences of words in a language, right?
Managing this with some sensible search procedure, that's called the decoder.
- What do you need in order to learn a language model?
We talked about this already! You can do something as simple as an n-gram language model!
Notably, to estimate whether one sentence is a translation of the other sentence, P(S | T), we have to know whether the *words* are translations of one another.
Say if you don't have a dictionary, what could you do?
What's an alignment? An alignment is your idea about which words "caused" which other words. Given two sentences, say one about the cats given earlier, we can write down an alignment.
There are many different approaches to alignment -- this is a fairly active topic in MT research, this particular sub-problem.
... then we could get the alignments, now couldn't we?
Break this down into two parts:
E-step: we end up getting the expected (as in "expected value", from probability) counts for each target word translating to each source word... (improve the alignments)
M-step: get the maximum likelihood for the translation probabilities (improve the translation table).
So basically we just *guess* some alignments, use them to set the translation table.
Then we use our new translation table to get expected alignments...
Then new alignments, we use to get a better translation table... and so on.
Iterate until we feel like we're done!
(for a good description of this, see Jurafsky & Martin, section 25.6 )
Simplest possible decoder: search for complete sentences!! (from J&M, page 891)
function stack_decoding(source):
initialize stack with null hypothesis
loop:
pop best hypothesis h off the stack
if h is a complete sentence, return h
for each possible expansion h' of h:
assign a score to h'
push h' onto the stack
All this says, really:
- In principle, it's purely-data driven. And language independent.
- Often works pretty well in practice!
- We start out with a linguistic problem, and we turn it into a machine learning problem.
-> And then if you're, for example, Google, you turn machine learning problems into systems problems...
- What if you don't have a lot of data?
-> what are scenarios where you don't have a lot of data? Quechua, Amharic, Guaraní ...
- Unlikely translation pairs? Welsh <-> Mandarin?
- What if your words are complex? In Amharic, for example...
ባይከፈትላቸውም --> "even if it isn't opened for them"
- What if you want *good* output?
We've been talking this whole time about MT, without really discussing what the output is going to be used for. What if you want to publish the output? Like as a book, or as a press release for your organization?
You could have a person look over it afterwards (post-editing)
You could do what's called Computer-Aided Translation (CAT).
You could have somebody build a domain-specific system -- there are a bunch of companies who will do this for you nowadays. Microsoft is getting into it too...
Hybridization. Including more and more syntactic and morphological knowledge.
New ways to get more data: DuoLingo, crowdsourcing...
Joint models? ...
Jurafsky and Martin, chapter 25
About Bernard Vaoquois: http://www.mt-archive.info/CL-1986-Vauquois.pdf
Dorr, Hovy & Levin. NLP and MT Encyclopedia of Language and Linguistics. Machine Translation: Interlingual Methods.
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.58.5324&rep=rep1&type=pdf
http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm
Statistical Machine Translation by Philipp Koehn: http://statmt.org/book