Juan Daniel Alvarez Becerra

Hi!

You saw my name on the top, but can just call me Daniel.

I'm 19 years old, and I'm in my last semester of the languages program. Throughout these two years, I have to say the class I've loved the most has been Linguistics: being able to see language from such an elemental perspective, and work our way through its basic functioning was surprising, to say the least. After Cegep, I would love to study either Linguistics or Literature, ideally in McGill.

I love reading, playing video games, and music. I would love learning to play an instrument well. I also like writing, but I still need to get the habit of writing regularly.


Conlang


I already have bases for a possible project, since it's something we developed in the course "A Way with Words" during our second semester. In it, I explained the basic history and functioning of a conlang (Constructed Language) named Quenya, and created by J.R.R. Tolkien for his works about Middle-Earth. I still have a lot of the information I compiled, from a small lexicon to several grammar and alphabet rules, so it could become a solid choice.


Translation

There is, however, another option I thought of, that could be fun and really interesting to talk about: translation. More specifically, how information gets distorted and sometimes changed altogether when switching from one language to another. We have seen this pehnomenon in many ways, from celebrating Valentine's Day twice as a result of a mistranslated marketing campaign, to a small ambiguity in a word that led to almost 200.000 deaths during the bombing of Hiroshima and nagasaki. I have researched other cases where translation has been a cause of confusion, but I'm still not sure if there are enough cases to write an entire text about them. Further research is required.

If serious cases of mistranslation is not enough, I could also direct it simply to the dangers and problems translators face, especially with local texts: how to reflect an economical or social situation depicted in a book, when you don't have the knowledge about the culture, nor the time to get properly acquainted with it? How to switch the thoughts of an author to an entirely different language, while still keeping their essence and the message they wanted to convey?






Video Presentation


Video Presentation.mp4




SOURCE ANALYSIS

ON IDIOMS



Cacciari, Cristina, and Patrizia Tabossi. "The Comprehension of Idioms". Journal of Memory and Language, vol. 27, December 1988, p. 668-683.

This is an academic article originally published in the University of Bologna. It is meant to show the results of an experiment and acts as a preambul to the next article I'll be analyzing, since it talks about two main hypotheses of idiom recognition: the Lexical Representation Hypothesis, where idioms are treated as entities separated from our normal recognition process and the only way of learning them is by having a "mental idiom list" that can only be memorized; and the one saying that everyone first analyzes the literal meaning of an idiom, and only search for its idiomatic meaning if the literal doesn't fit with the context.

To refute this, an experiment was created using idioms that didn't have a literal equivalent (phrases that didn't make sense if taken literally and therefore could only be understood as idioms), but wouldn't be recognized as such up until the last word. (p.e. "he was in the seventh heaven" to mean extremely happy"). Then, participants were asked to associate the ambiguos word (in this case, "heaven") to one of three words: saint (semantically related to the word), happy (semantically related to the idiom) and umbrella (control word). If the previous hypotheses were correct, participants would first associate it with the word "saint" .

Three different experiments were made, each one ensuring the results were as faithful as possible, and they showed that most people reacted faster to an idiomatic recognition only when the idiom was well-known or predictable. The general conclusion, however, is that both methods of recognition occur in parallel, but the idiomatic gets relevance only after the appearance of a "key" in the phrase. However, there is no way of specifying what this key is, as it varies depending on the idiom: it can be a preposition, one specific word, one adjective; sometimes it isn't even sure what the key could be (as in "kick the bucket"). Conclusion of the study: "[...] a pattern of results emerges that suggests differences in the way in which idioms can be recognized depending upon how early in the string they become identifiable."


Thoughts


This article doesn't really stand on its own, since it functions as the introduction for the concepts Cristina Cacciari introduces and develops in the next article. However, it does present an interesting concept that isn't mentioned as much in the next article; and that is the presence of a "key" that changes the literal meaning of a phrase to its idiomatic counterpart, as well as the flexibility it can have. Not only it makes me see idioms in a totally new light, being able to see the thin line that really separates literal phrases from idioms, but it can also shed some light on the process we could use to program good artifical translators: making it look for the element (or elements) that make a literal understanding of a phrase impossible when used in a particular context, so it can learn to differentiate literal from idiomatic expressions.




Cacciari, Cristina, and Sam Glucksberg. "Understanding Idiomatic Expressions: The Contribution of Word meanings". Advances in Psychology, Vol. 77, 1991, p. 217-240.

This is an academic article originally published by the University of Bologna for the Advances in Psychology magazine. The main author, Cristina Cacciari, manages to quote herself several times during her article, giving a sense of credibility and experience to her claim. Thanks to the amount of information present in this article, we can safely assume it was meant to add to a body of research. In this case, adding to the psycholinguistic approach of idioms.

This article can be seen as a continuation to the previous article, where she refuted some of the prevalent theories about idiom recognition. In this case, she tries to place idioms in certain categories, explaining different theories concerning the comprehension of idioms: ones that treat them as different from other language processes (non-compositional), while others treat them as part of the normal language use (compositional)

Non-Compostional Theories: try to explain why we first understand the metaphorical sense of an idiom before its literal sense. Different hypothesis exist:

  1. Having a mental "idiom list" to which we refer when we don't understand the literal meaning of a phrase (Lexical Representation Hypothesis)

  2. Matching idioms as "long words", because word-recognition is faster than phrase-recognition

  3. Bypass entirely the literal meaning thanks to context.

However, all these have been refuted.


Compositional Theories:

Configuration hypothesis: literal meaning of words are activated in our process of recognition at the same time as the idiomatic meaning (treating the phrase as a single unit, and not each word it's composed of) and remain activated (as seen with discourse productivity below).

The proper meaning of words plays different roles with these theories, especially:

  1. In immediate idiom recognition,

  2. in the lexical flexibility of idioms: altering certain words of a known idiom without it losing its meaning because of it (crack the ice for break the ice)

  3. In semantic productivity: including idioms in otherwise literal phrases ("no matter how much they tortured him, he didn't spill a single bean". In this case, we see how we can use semantic elements of an idiom to create a completely new one),

  4. and even in discourse productivity: adapting conversations and lexicon to create idiomatic conversations

  • (A: "did he kick the bucket last night?" (B: "Nah, he barely nudged it"

In the previous example, "not even close" is to "nudge" as "kick" is to "die"; and in this case, "nudge" is used to respond to the verb "kick", both of them being used in similar contexts)


This article also makes the difference between decomposbale and non-decomposable idioms:

Decomposable: those that may have a semantic relation with what they're trying to convey (pop (suddenly ask) the question (marriage proposal)

Non-decomposable: those that don't have this relation between the idiomatic and the literal meaning (kick the bucket for dying)

There are also abnormally decomposable idioms (ex. "spill the beans", since there is no clear relation between "beans" and "secrets", but the idiom can still be understood with this clarification (spill the secrets)

The fact that people understand variations of an idiom (capable of understanding "didn't spill a single bean", for example) proves that idioms are not understood as a unitary string, since the words inside the idiom can be used to generate and interpret idiom variants.

Decomposable idioms are more lexically flexible than non-decomposable idioms (very understandably)

In conclusion, "People cannot isolate or ignore the meanings of words or the meanings of phrases when engaging in discourse. At the same time, people rely on familiar, memorized “chunks” of speech whose meanings derive not only from the language itself, but from their role in everyday experience. " (Cacciari)


Thoughts


As I was reading this, I couldn't help myself from thinking "this is a really interesting article, and the way it manages to calsssify and better understand idioms is really good. But will it be useful for my essay in particular?"

And it took me a while to find my conclusion, but in the end I realized that this is a very useful information, even if it may not seem like it at first glance. Being able of understanding how humans process idioms when trying to understand them is a basic skill if we ever want to know how to translate these expressions properly. Not only that, but understanding this thought process is also fundamental to achieve a fluid artificial translation, since we'll be able to program the machine to follow this same recognition process, and thus it'll translate a phrase with its literal meaning when necessary, and with its idiomatic meaning when the situation requires it.




Cabag, Yen. "23 common idioms and their surprising origins". TCK Publishing. https://www.tckpublishing.com/common-idioms-and-their-origins/ Consulted April 27th 2022.

This article gives a small overview of the origin of certain idioms that even to this day we use often. Some of the most surprising are:


Armed to the teeth (overly well equipped): apparently, during the 17th century, pirates were always making sure they never ran out of ammunition and were always prepared. For that, they held a gun in each hand, another one in their pocket and a knife between their teeth.

Break the ice (To make a group feel comfortable, so as to cultivate friendship): During the time of maritime trade, sometimes the cargo ships would get stuck on the ice during the winter months. In those cases, the receiving country would send smaller boats to break the ice that held the ship and help them get to haven, to the point of becoming a show of friendship between both countries.

Cat got your tongue? (when a person is unable to speak or answer a question): This one has two possible origins: either it refers to the cat-o'-nine-tails, a whip used for flogging in the British Navy, that caused a pain so sharp that left the person unable to speak. Another theory comes from Ancient Egypt, saying that liars and blasphemers got their tongues cut and fed to the cats.

Riding shotgun (To ride in the front seat of a vehicle): this one comes from the Wild West, where there were always two people at the front of a carriage: the one who held the reins of the horses, and the one who sat behind them with a shotgun in their hand, in order to protect themselves and their passengers from robbers.

Straight from the horse's mouth (getting information directly from the source): During the 1900s, it was common to check the teeth of a horse to determine its age, thus confirming or refuting the information the seller was giving (as a fun fact, this is also the origin of the expression "don't look a gift horse in the mouth")

Turn a blind eye (to ignore something deliberately): During a battle between the English and the Danish Navy, Admiral Horatio Nelson's superiors were sending him signals from another ship, ordering the Admiral (who was blind from one eye) to stop the attack on the Danes. Nelson, however, raised his telescope to his bad eye, thus claiming he did not see the signals coming from the other ship, and continued the attack, which resulted in a British victory.


Thoughts


Even though I may have to cut this information due to a lack of space, I still think knowing the origins of these expressions is fantastic, not only because of their cultural and maybe historical value, since they reflect some aspects of their way of living during that time. but also because it's another proof of how language evolves alongside history, and how both are always intertwined: History shapes the way we use language, and language reflects and conserves parts of our history that we could otherwise forget.


On Machine Translation



Carl, Michael. "METIS-II: the German to English MT System". Proceedings of Machine Translation, 2007, p. 35-47.

This informative article, published in 2007 by Michael Carl, gives an account of one of the programs designed for Machine Translation. It divides the program's functioning in four simple steps, and talks about the inner functioning of the machine itself.

Architecture of a machine translation. "uses rule-based devices to generate sets of partial translation hypotheses and a statistical Ranker to evaluate and retrieve the best hypotheses in their context."

This device generates a AND/OR graph, showing many different translations, and the Ranker is an algorithm that tries to find logical paths in this graph.

Steps of analysis:

Analyser: Analyses the Source Language sentence and produces a flat grammatical analysis, identifying possible subjects, clauses and phrases.

It classifies every aspect of the phrase: prepositions, finite verbs, articles, subject (or not subject), and in general apply every possible or logical tag to the parts of a sentence.

Dictionnary Lookup: Looks for matches and equivalents between the sentence in the Source Language and the TL (translated Language)

After the sentence has been analyzed, these annotated words then become the input for the lookup. It first regroups all words in different groups, and then looks for the best translations for all different groups.

However, the system still has trouble translating discontinous phrases (5-word phrases inside a 15-word sentece, p.e)

The system also may encounter problems of overgeneration of translation hypotheses, so the ranker may filter them and choose the best possible options. This may cause semantic ambiguity, confusion in the use of prepositions and verbs (when one word can have different translations depending on the context, such as "stark" in German meaning "strong", or "best", or "heavy", etc.),

Expander: "inserts, deletes, moves and permutes items or chunks according to TL syntax". It can also produce alternative partial translations

Ranker: Computes a n number of most likely translations. It evaluates each proposed phrase with a set of predetermined rules, and only the highest-ranking ones are showed as output.

Specific examples of this process can be found in the article

It is also worth noting that METIS-II is not the name of this MT. Rather, it is the name of the project that englobes all MT efforts.


Thoughts


Concerning the article itself, I have to say that it was in part very useful, but in part also very confusing. Very useful because it clarified for me what METIS-II actually was (before this article, I thought METIS was an actual machine and not the name given to the entire project of Machine Translation) and also presented a very valid method in clear and concise parts. The problem, however, is that, whenever we talk about programming a machine to do translation work, we enter into the domain of computational linguistics, a field that I know nothing about, making it difficult to understand the examples and comprehend the process behind the translation itself. I understand small glimpses and I can infer some others, but the general idea and functioning are still way out of my grasp. This article, however, will complement fairly well the future articles I will have to read about the history of METIS-II and will be a good point of comparison to see how far machine translation has gone since 2007, and which are the aspects that even now can still be improved about it.




Dirix, Peter, et al. "METIS-II: Example-based machine translation using monolingual corpora - System description" Centre for Computational Linguistics, Leuven University, 2005.

This article written in 2005 gives a general overview on the METIS-II project and the objective of creating an accessible and simple Machine Translation. Their final conclusion shows promising results: even though ceratin irregularities still appear, the correct translation was achieved almost 60% of the total tries. Keeping in mind this article is almost 20 years old, we can only imagine how much the field has improved since.

METIS-II: example-based machine translation, which originally was meant to be a cheap (with basic resources) translation system without any use of parallel corpora (internet, lexicons, anything external to the machine (I presume)). It is also based on the assumption that it will not have an extensive rule set preprogrammed to it.

It woulds also use a Translation Memory that will significantly improve future translations, since it will record previously translated phrases as an "extra bilingual set of preferred translations", improving in the use of light verbs and different uses of prepositions.

Hybrid approach: combination of statistics, and linguistic rules.

Since the very point of METIS-II is to be accesible and "cheap", the only thing that the machine needs, beside a bilingual dictionary and data from the Translated language, is:

  1. a tokeniser

  2. a part-of-speech tagger (I'm guessing something similar to the analyser of the previous example)

  3. a chunker (equivalent of expander?)

  4. a morphological generator

"METIS-II targets the construction of free text translations making use of pattern-matching techniques and target-language retrieval from a large monolingual TL corpus."


The METIS-II system has two particular tasks: finding the good translation for words, and the appropriate context.

Sentence is divided and "chunked". Equivalents in TL are searched


they translate expressions as one single chunk during the SL analysis


General concepts:

Universal data format: Equal representation for all data that goes through the system. Must be able to represent all information needed by the processing modules, and allow and deal with ambiguities on different levels (words, possible translations, tags, etc)

Dictionary format: Having at least four columns: lexicon and Parts-of-speech both for the Source and Translated language, but more can be added

Weights: evaluation of all possibilities when encountered with tag ambiguity (a chunk can be classified in different groups, depending on the context (NP, DO, etc.)) and choosing the most reliable translation

Mapping rules: performs changes between SL and TL's tokens and strings


Translation flow:

  1. Tokeniser/tagger/lemmatiser: Reads the sentece in the SL. Separates words from punctuation, adds tags marking words and sentences, recognizes multiword units (as far as, time and again, etc.), recognizes and relates discontinous parts of tokens (separated verb particles p.e. , such as "maakt ... open", verb in German that means "to open" but has the DO in-between both words)

  2. Chunker: All separate parts are identified, and will be seacrhed for in the TL corpus. This will be done either by making use of grammatical units (NPs, clauses, etc.), or by making use of statistical units

  3. SL to TL mapping: travelling between the original information in the SL and all the possibilities and ambiguities (either in tags, definition, clause, etc.) in the TL

  4. TL generation through a preprocessing of the corpus, a search engine and the postprocessing


Thoughts


Even though the article repeated information I already had read (since the methodology is really similar to the one presented in Dirix's text), there is a really useful piece of information here: originally, the METIS-II project translated expressions and idioms as one single unit. This may not seem like much at a first glance, but in reality it complements the information I found in Cacciari's article, and shows that there is much room for improvement in translation of idiomatic expressions. After all, if they are translating idioms as a singular unit, that means they have to compute manually every possible expression they can think of (and their equivalents) in order for the machine to execute the translation. And even then, it can't account for their flexibility or productivity traits during a conversation.

Therefore, this article gives me a sense of direction on a possible route for my essay, now being able to direct it towards a (even if somewhat basic) proposal for an effective MT for idioms.




"Neural Machine Translation Tutorial - An Introduction to Neural Machine Translation" Youtube, uploaded by Fullstack Academy, August 8th 2017, https://www.youtube.com/watch?v=B8g-PNT2W2Q

This recording of a lecture given on Fullstack Academy of Code showcases in a very clear and simple way the basics of neural machine translation, firstly explaining that this method can be used either at document-level or at sentence-level. With this method, the computer basically tries to imitate and recreate a neural network to teach itself how to translate in certain instances. This neural network makes it possible for the computer to translate entire chunks of information, entire sentences without having to break them in parts and analyze them separately (contrary to what we have seen previously in statistical machine translation)

An example presented in the video shows the use of zero-shot translation. This translation is defined as "translating phrases for language pairs where no existing training or mapping exists". This can be better understood with an example:


Here we see how the program is firstly trained by showing it the translation of a phrase from English to Korean, English to Japanese, and Japanese to English. With the input it has received, the program is now capable of effectuating by itself the rest of the translations, with no need of further human input.

This is a big contrast and a vast improvement to the previous translation methods used, specifically the statistical translation, because it is now capable of effectuating these translations much faster. With the statistical method, every time a phrase had to be analyzed, the program first have to decompose it, evaluate and tag its elements, search for their equivalents in the dictionary, and reorganize the information according to the TL's grammatical and structural rules before presenting a final translation. The neural translation, on the other hand, tries to translate entire chunks of information, grasping the general sense of the phrase rather than its indivudal components, while at the same time creating a neural network and a database that will be useful for future translations.


Thoughts


This video was honestlly a relief, and a really appreciated change from the heavy and hard-to-understand vocabulary of academic writing. Not only that, but it also presented an entirely new aspect of machine translation that I hadn't seen up to now, helping me better see the difference between different types of translations and have a better overview of what I am trying to accomplish. With the newfound possibility of the computer creating databases by itself, and only relying on human input for correction and further improvement, being able to faithfully translate idiomatic expressions seems much closer. In fact, this video gave me an idea of how my proposal could attack the translation of these expressions (always keeping in mind I don't have enough programming knowledge to know how feasible it would be):

since this new technology is able of generally translating entire chunks (generally) faithfully, I wonder how possible it would be for the translator to analyze the context in which the phrase is said, evaluating (through the observation of vocabulary used in the phrase) how likely it would be for an expression such as "kicking the bucket" to be used thinking of its literal meaning. If not, and the computer decides it is used thinking of its idiomatic definition, then it could start searching for an equivalent in the target language, using a database it will create and keep expanding every time a new translation is required.




Mandal, Some. "Evolution of Machine Translation" Towards Data Science, June 4th 2019, https://towardsdatascience.com/evolution-of-machine-translation-5524f1c88b25

An informative article published in the website "Towards Data Science", it gives a small overview of the evolution of machine-based translations, starting from its inception in 1949.

1949: Warren Weaver presents the first proposal for machine-based translations.

Followed a period of research and improvement extending all the way to the 1990s, but very limited: most machine translations were limited to a single pair of languages and used rule-based engines.

1990: Peter Brown proposes for the first time the use of statistics into machine-based translations. He presented the concept of fixed elements in a phrase, translating a sentence by dividing it into words or smaller chunks and translate each of them with the use of a glossary, rearranging said translations to acquire a "target sentence".

1993: Brown also proposed an algorithm that took into account the different possibilities for possible translations, trying to define which would be the most reliable and possible follow-ups to a word

For the next 20 years, research was focused on improving this statistical method, adding elements that allowed chunks to be translated based on their relative alignment rather than their absolute one, implementing syntax-based models that allowed to take word order and case marking into account, and even suggesting the use of hierarchical phrases (phrases that contained sub-phrases in them)

2013: Nal Kalchbrenner and Phil Blunsom propose an encoder-decoder structure for machine-based translations, based on continuous representations of words and phrases, rather than stiff phrase units or alignments

2014: Proposal of Deep Neural Networks that converted the input into a vector of fixed dimensions (the encoder) and a decoder that translated a target sequence from that vector. Based on Recurrent Neural Networks (RNNs)

In the same year, Dzmitry Bahdanau proposed a model that automatically soft-searched certain parts of a phrase (elements that could help predicting a target word) without having to form these parts as an explicit fragment.


Thoughts


Even though simple and relatively short, this article confirms and expands upon the information I already had concerning the history of machine translation: how the first ones were rule-based and really rudimentary, then evolving into statistical-based to finally arrive to the most recent ones, trying to imitate the neuronal networks formed by our brain. The information is not new for me, but it certainly makes the timeline a lot clearer. It will also help me presenting this same evolution in my essay more clearly, and give my readers a better sense of how this field has changed throughout the years.

Since it was a short reading, and not an academic article, it was way easier to follow than my previous sources. Even so, it still used a lot of technical terms and programming concepts that were new for me, so I still fail to understand the specific functioning of the machines. As long as I don't take a class of computational linguistics, my knowledge on the field will probably remain on a superficial level.




Farkas, Anna, and Renata Nemeth. "How to Measure Gender Bias in Machine Translation: Real World-oriented Machine Translations, Multiple Reference Points". Social Sciences and Humanities Open, 2022.

Written and published in Hungary, This article evaluates hidden biases in AI translations, in this case concerning gender. They compared certain Hungarian words, a language with gender-neutral pronouns, with their English equivalents, looking for preferences in certain professions.


One aspect they really emphasized in this article is the reality that this Machine Translation (MT for short) doesn't generate these biases. Indeed, all compiled, organized and programmed data used by them passes first by human beings, making possible biases very likely. Analysis made on several translation applications, but more notably Google Translate, basing the bias in the U.S. Statistics of number of men and women in different occupations.

Results: 36% of occupations were mistranslated on Hungarian census data, of which 76% were disadvantageous for women. A similar pattern with the U.S. statistics, with a mistranslation of 44%, of which a female pronoun should have been used 71% of these times.


Thoughts


The redaction of this article made it slightly difficult to understand, but the overall message was that, as far as Machine Translation has gone, it still has some weak points that stop a true, fluid and faithful translation from ocurring; in this case, being forced to make a gendered translation from gender-neutral languages and basing its decision on previous biases and stereotypes, instead of factual statistics.

Even though this article certainly enlarges my essay's field of study, now being able to include all possible weak points of MT and their improvements, I can't say this particular aspect of it interested me greatly.




Anastasiou, Dimitra. "Idiom Treatment Experiments in Machine Translation". Cambridge Scholars Publishing, 2010.

This is an academic book that brings together the most important studies on machine translation and idioms, ranging from studies from the 1950s all the way to the beginning of the 2000s. It treats subjects like evolution of machine translation, idiomatic functioning and translation, as well as commercial translators and the most prominent companies in the field.


Compositional vs. non-compositional theories: compositional ones have their meaning implied in the meaning of the words they use, whereas non-compositional ones have a figurative meaning completely different from the words they use.

"Idioms" is merely a term that's part of a broader group of expressions with figurative meaning, where there are also metaphors, proverbs, collocations, etc. It is also really hard to coin a "good" definition for the word "idiom", since it is normally applied to very fuzzy and different phrases.

Grammatical vs. extragrammatical idioms: those that respect grammatical rules ("kick the bucket", "spill the beans"), and those with an anomalous structure ("by and large", "so far so good", etc.)

Many idioms can be considered "dead metaphors", becuase nowadays people use it and know what it's supposed to mean, without really knowing why (ex. "break a leg")

Idioms can also be divided into those only an idiomatic meaning ("be all thumbs), and those who have both a literal and a figurative one ("all dogs have fleas")

Concerning translation of idioms, there are different levels of equivalence for idiom translation:

semantically/syntactically equal: when the idiom can be translated word for word, and it has an equivalent in the target language

Semantically unequal - syntactically equal: when the phrase order and word positioning is the same in both idioms, but the words aren't exactly the same ("kick the bucket" vs "casser sa pipe")

Semantically equal - syntactically unequal: Same vocabulary, but different positioning

semantically/syntactically unequal: when there is no direct translation from one language to the other.


Thoughts


Rather than taking it as a primary source, this book was really useful as a source for other articles, that helped me focus and orient my research from the very beginning, making it an immense well of information. The articles presented were slightly outdated, sometimes taking studies from 1952 as examples, but the overall ideas were well complemented throughout the chapters.

Since I took the book a source of sources, rather than a source itself, the information it presents wasn't evidently new for me, since I had already the articles the book was referring to. However, it did add some aspects, specifically to the division and understanding of idioms, that I hadn't read before but complemented well my initial research. Unfortunately, I'm not sure of how much of this new information I'll be able to add to the essay, since I'm so limited by the number of wordsand by the absolutely curcial information I already have to add.



First Approaches Towards the Final Essay


Preliminary Thesis Statement:

"Machine Translation of idiomatic expressions has always been problematic. By understanding their nature and the way MT has adressed it through time, we can propose an improvement of idiom translation."


Based on all the information I've gathered, I consider the best option right now would be to write a proposal of an integrated Machine Translation, focused primarily on the translation of idioms. Rather than an argumentative essay, it will be an explicative essay, divided in three main aspects:

the first topic will be a short explanation of the history and evolution of machine translations, from their inception in the 1950s all the way to the present, explaining the different types of MTs and the differences between them.

The second topic, probably slightly shorter, will talk about the functioning of idioms: their compostitions and use, as well as some of the reasons why it's so difficult for machines to translate idiomatic expressions faithfully.

And finally, the third topic will be a summary of all the information explained above, concluding with the proposal of an improved Idiomatic Machine Translation.

First Draft Essay Outline


Thesis statement: By understanding the nature of Machine Translation and the way it has addressed the translation of idiomatic expressions through time, as well as by understanding the functioning of said expressions, we can propose an improvement of idiom translation.

Sub-Points:

  1. Evolution of Machine Translation throughout time

    • Rule-based Machine translations

    • Statistical Machine Translations

    • Neuronal Machine Translations

    • Problems with Idiom translations, and introduction to idioms

  2. Idioms

    • Idiom Recognition Hypotheses

      1. Compositional vs. non-compositional theories

    • Decomposable vs Non-Decomposable idioms

    • Roles of words in these theories

      1. Lexical flexibility

      2. Semantic productivity

      3. Discourse productivity

  3. Proposal





Funny examples of mistranslation

1. "Mama used to say that machines were like a box of chocolates: you never know what you're gonna get"

While doing my research, I've found really funny examples that showcase, not only how machine translation can go wrong, but how it can commit mistakes even when taken into context!

I found one of these examples during a conference where they were explaining the functioning of neural MT. As he talked about how the algorithm analyzed information based on previous data, he presented the example of a Japanese idiom: Ichigo ichie (). This is one of those idioms that a machine would have a really hard time translating, because it would need to be translated as an entire phrase in English to grasp its meaning. A literal definition would be "For this time only, once in a lifetime", which is normally understood as treating each moment and each person you meet as if it were the only time you will ever live it/see them. So, what happens when we put this idiom in one of the translators?

Well, surprisingly, it translates it as "Forrest Gump"!

While it may feel really arbitrary, it makes sense when you realize that was the Japanese translation of the movie title (which actually fits perfectly with the plot and the theme of the story). The problem with this is that, in most cases, the MT cannot differentiate which translation the user is looking for, so it will go with the most popular search (which in most cases ends up being the movie)

Source: "Neural Machine Translation Tutorial - An Introduction to Neural Machine Translation" Youtube, uploaded by Fullstack Academy, August 8th 2017, https://www.youtube.com/watch?v=B8g-PNT2W2Q



2. "Let's pack more food, just in case where"

We know that our current models of machine translation have a really hard time translating any kind of expression, and even more so when its use is restricted to a particular region. The example I put in the title is obviously an exaggeration (even now, machines are capable of translating "juste au cas où" without having to translate it word for word), but it did happen to me once with an expression we use often in Colombia, and it was one of the best laughs I've ever had with a computer.

We have an expression that is used fairly often (at least in the capital), that is "por si las moscas". Now, a human translator, with a bit of research, would be able to figure out that this expression normally means, and can be translated as, "just in case." However, one time the translator wasn't able of recognizing this expression, and was forced to translate it literally, giving us the beautiful phrase "for if the flies." At the time it was so funny, that it actually became a phrase we use in our house, as well as other expressions that we translate literally just for the fun of it.

Source: Personal experience


Final thoughts


In the hypopthetical case you decided to skip all the way to the conclusion, scared by the huge blocks of text (and I wouldn't blame you, honestly), I'm leaving here a small summary of my entire project:

Originally, my plan was simply to research about the differences between natural and artificial translation, talking about the pros and cons of each. But After reading the first couple of sources, I realized, not only the subject was way too vague, but also that the angle I was taking didn't interest me at all. This generality made the entire subject feel dull and boring. Because of this, I knew I had to orient it towards a subject easier to approach, but that also had enough information and variety to be appealing to more people than just a bunch of students.

And that's when it hit me: What is something that we always use in our daily lives, but that we normally don't pay attention to? Something so engrained in our language that it's weird to think about a reality where we don't have them? Something that had always fascinated me, not only because of their origins but also because of how easy it is to make fun of them in other languages?

Idioms.

I had always loved the historical aspect of idioms, as well as making jokes about them in different languages, so what better excuse to learn more about them than this project?

That's how I ended up talking about the place of idioms in machine translation. It was only a matter of time and thought before I found the main thesis for my project, proposing an improvement on the translation of idioms with our current methods.

After this, the research went pretty straightforwardly, first learning about the current state of machine translation. I learned a lot about Statistical Machine Translation and its process of analysis, before realizing it wasn't the most recent model, and then focusing on Neural MT and its advantages over the previous model (such as being able to learn from past experiences and produce translations of its own, instead of having to compute everything manually). Similarly, I learned a lot about the construction of idioms, and the ability they have to affect words and their roles in a normal phrase, as well as how our brain processes them. In fact, as I'm writing this, I realized a good analogy to understand this process is the "Schrodinger's idiom": our brain will analyze the literal and figurative meaning of a phrase at the same time, making a phrase both an idiom and not an idiom at the same time. And we won't know which one it is until we find the key word that defines it.


I have to admit the academic documents were hard to understand. Not only because of the language they used, but also because I was treating a subject that was closely related to computational linguistics, and therefore programming. In practice, this means that I found myself really often reading algorithms and mathematical equations and codes that I didn't understand, forcing me to specify that I could only speak from what I knew (namely, a purely linguistic perspective). Overall, though, I still think the project ended really well. I feel like I understood the core concept of what I was reading, and want to believe I explained it well enough to be easier to understand. I don't think my proposal could have any practical value in the real world, precisely because of my lack of analysis from a computational perspective, but it was still a really good research exercise, from which I will surely benefit in the future.