This blog dates back to my experience in doing the assignment for the Natural Language Understanding, Natural Language Generation, and Machine Translation (NLU+/ INFR11157) course at the University of Edinburgh.
In the assignment, I worked on English - German translation. However, there was a self-discovery question concerning greedy decoding, which we were asked to pick any language(s) to work on to answer it. Naturally, I chose Vietnamese, my other native language, and Japanese, a language I have been learning (and trying to improve through formal learning, rather than absorbing through reading manga, watching anime and NHK dodrama). In this blog, I will mention Vietnamese - English translations only.
🤔🤔🤔I prompted Google Translate in March 2024 for the sentence:
"Để chạy trường cho con, tôi tốn rất nhiều tiền vào năm ngoái."
March 2024 Google Translate's answer:
🙃“To run a school for my child, I spent a lot of money last year.” 🚨🚨🚨
The correct translation (and I just checked that in December 2024 Google got this correctly):
😌"To get my child into a good school, I spent so much money last year.”
The problem was with the translation for "chạy". Greedy decoding (in March 2024) favored “run” instead of “get” because the most common meaning of the Vietnamese word “chạy” is the action of running. Hence, greedy decoding decided that the translated sentence needed a verb after “to” and “run” is the most probable because its very high probability of being the translation of the input token “chạy”. However, to translate the aforementioned Vietnamese sentence correctly, the decoder must look at the larger context—specifically the later part of the sentence, which is “tôi tốn rất nhiều tiền”. In Vietnamese, the word “chạy” also describes the action of spending money to secure a favorable position. Therefore, if the decoder did not look at the later context and continues to generate, the final translation was “To run a school for my child, I spent a lot of money last year”.
As of December 2024, I believe that as Google takes more Vietnamese data, and the word "chạy" with the aforementioned meaning related to money has been used more and more in Vietnamese, there is no doubt that Google Translate can translate the Vietnamese sentence above correctly.
However, there are still Vietnamese sentences that Google Translate cannot translate correctly until now:
"Sao phải xoắn."
Correct translation: why do you have to be afraid
Google Translate's output: why twist
"Máu chảy ruột mềm."
Correct translation: When a family member is in danger, you will feel pain.
Google Translate's output: soft bowel bleeding
These Vietnamese sentences are used frequently in a Vietnamese person's life. The first sentence is slang. The second sentence is an idiom. Basically, concerning these two sentences, Google Translated translated word-by-word.
With the introduction of newer and better Large Language Models trained on all data can be found on the Internet and even synthetic data, will they give better translation than Google Translate - an application built solely for translation?
In this blog, I will prompt GPT-4o through ChatGPT's API to uncover if it can give correct translations for Vietnamese idioms and slang, which are considered to be some of the hardest types of text to be translated correctly (Naveen & Trojovský, 2024).
✨Vietnamese idiomatic translations ✨
With this toy example concerning some common Vietnamese idioms, there are some trends here.
Most English translations were incorrect.
GPT-4o can give different translations based on the time of prompting. This is expected because of the probabilistic nature of LLMs. However, it is noticeable that for different translations based on different prompting times, one translation could be correct while the other was not.
In most cases, GPT-4o produced word-by-word translations, which is not desirable. In an ideal situation, the translations need to have the same meaning as the source text as well as be demonstrated stylistically the same. For example, "Ngưu tầm ngưu, mã tầm mã." should be translated to its equivalent English proverb "Birds of a feather flock together."
GPT-4o seemed to strongly link "phá gia chi tử" with "unchecked spending" only. However, in Vietnamese, this idiom has the overarching meaning of "ruining the family and tarnishing the family's reputation". "unchecked spending" is just an aspect of the meaning of "phá gia chi tử".
There was also a sign of gender bias in the translation, demonstrated through the translation of "Phá gia chi tử." This Vietnamese idiom means that a child in the family, be it a son or a daughter, ruins the family and tarnishes the family's reputation, by squandering the family fortunes, committing a sin, having no ambition for education and work, and/or hanging out with bad people. However, GPT-4o gave one translation, "A son who destroys his family fortunes". This bias might have been caused by gender-imbalanced dataset, where males are represented more in the pre-training and/or fine-tuning dataset. This is not surprising concerning Vietnamese because Vietnam is a heavily patriarchal society.
🧑💻Can prompt engineering better the translation for Vietnamese idioms? 🤔
To get the translations for the 5 aforementioned Vietnamese idioms, I used the prompt template:
"Translate this Vietnamese idiom into English: [insert Vietnamese idiom]"
To be an ideal translation, it must convey the closest meaning to the source text as well as have the same stylistic (for example, an idiom from the source language should be translated into a meaning-equivalent idiomatic expression in the target language).
I expanded the above prompt to give more directions to GPT-4o. I expanded them in two ways.
Method 1: Telling the LLM that I want an equivalent English idiom, proverb, or proverb:
The prompt I used for this method is:
"Translate this Vietnamese idiom in an equivalent English idiom or proverb or simile: [insert Vietnamese idiom."
The results were better:
Máu chảy ruột mềm ---> "torn apart"
The translation was quite close. GPT-4o admitted that there is no English equivalence and it thinks "torn apart" might demonstrate the feeling most closely to the source idiom.
Khẩu Phật tâm xà ---> A honey tongue, a heart of gall
This translation was also quite close since the translation could not fully demonstrate the evil extent of the person alluded to in the Vietnamese idiom.
I also translated some other Vietnamese idioms with this expanded prompt:
Khoẻ như voi ---> As strong as an ox (correct 🙂↕️)
Yếu như sên ---> As weak as a kitten (correct 🙂↕️)
Dĩ hoà vĩ quý ---> Live and let live (correct 🙂↕️)
Một công đôi việc ---> Kill two birds with one stone (correct🙂↕️)
Ngựa quen đường cũ ---> Old habits die hard (correct 🙂↕️)
Method 2: Asking the LLM to translate a short chunk of Vietnamese text where the meaning of the Vietnamese idioms can be guessed or understood by the surrounding context:
The prompt I used for this method is:
"Translate this Vietnamese text into English: [insert the Vietnamese text]."
For example:
(1) Translate this Vietnamese text into English: "Tống Nghi Xuân thét lớn: Tống Mặc, ngươi là đồ phá gia chi tử. Ngươi dám buộc tội ta trước triều đình. Hỗn xược. Cả cái nhà này bị ngươi phá tan. Cậu ngươi, mẹ ngươi, ngươi, người của Nghi đều là đồ quỷ ám. "
---> GPT-4o's translation: "Tống Nghi Xuân shouted loudly: Tống Mặc, you are a prodigal son. You dare to accuse me before the court. Insolent. This whole family has been ruined by you. Your uncle, your mother, you, and everyone related to Nghi are all possessed by demons."
😓 🚨 Again, GPT-4o gave the incorrect translation for the Vietnamese idiom "phá gia chi tử". With this example, it seems that GPT-4o demonstrated a strong favor to link "phá gia chi tử" with "a prodigal son". At this point, I start to wonder if GPT-4o was trained on all instances where "phá gia chi tử" is strongly linked with "a prodigal son."
(2) Translate this Vietnamese text into English: "Ngoài mặt, Vương Ánh Tuyết luôn nói những lời nhẹ nhàng và yêu thương với Đậu Chiêu, đứa con riêng của chồng. Nhưng trong thâm tâm, cô ta hận đứa bé đó đến tận xương tuỷ. Cô ta coi đứa bé đó là trở ngại còn sót lại duy nhất trong việc thống trị nhà họ Đậu. Nhưng cuối cùng, loại người khẩu phật tâm xà như cô ta cũng bị đưa ra ánh sáng. "
---> GPT-4o's translation: "On the surface, Vương Ánh Tuyết always spoke gently and lovingly to Đậu Chiêu, her husband's illegitimate child. But deep down, she hated that child to the bone. She saw the child as the last remaining obstacle to her domination of the Đậu family. However, in the end, someone as deceitful and malicious as her was also brought to light."
☺️ This translation is correct as "deceitful and malicious" fully conveys the meaning of "khẩu phật tâm xà". Moreover, the translation needs to adhere to grammatical rules in English. Hence, even though "A honey tongue, a heart of gall" is equivalent to "khẩu phật tâm xà", using it in the translation, in this case, might not conform to English grammatical rules. GPT-4o here chose to translate by explaining the meaning of the Vietnamese idiom.
✋✋Minor conclusion:
Crafting a more specific prompt to give the LLMs better direction can result in correct or close translations from Vietnamese idioms to English. However, the LLM can still give wrong translations, because the LLM was pre-trained on data demonstrating only one meaning of the Vietnamese idiom.
It is also worth mentioning that there are cases where there is not an equivalent idiom or phrase in English. In these cases, it is up to the translator to decide whether to use the suggested translation given by the LLMs, or to explain the Vietnamese idioms as the translation method. Hence, LLMs can be a powerful tool, experienced translators are still needed, especially in the post-translation stage to provide the final translated text.
Phan Đình Phùng Street, Hanoi - In all of the glory of autumn sunlight
✨ Vietnamese slang translation ✨
GPT-4o gave correct translations to most slangs, even for slangs appearing just recently! When I saw the (5) - (10) translations, I thought GPT-4o is so Gen Z 🤣
I just prompted:
"Translate from Vietnamese to English: [insert Vietnamese slang/insert a sentence containing a Vietnamese slang]"
And for all of the examples, GPT-4o was able to identify all of the slangs correctly without being asked to identify them! However, GPT-4o still gave incorrect translations for two recent slangs, "Jack" and "mận", both appearing in 2022. Simultaneously, GPT-4o could correctly translate a mid-2023 slang (khum) correctly. I suppose that there are plenty of Vietnamese texts with these slangs that were included in GPT-4o's pre-trained training dataset. After all, LLMs are trained on Common Crawl, and these slangs are used heavily on social media platforms by Vietnameses.
😯🤔🧐 At this point, it is a bit weird that GPT-4o could translate Vietnamese slangs better than Vietnamese idioms, even concerning cases where I gave more direction in the prompts to translate the idioms. One explanation I could think of is that there is less digital texts containing Vietnamese idioms than digital texts including Vietnamese slangs, leading to imbalances in GPT-4o's pre-trained training dataset.
Now that I think a bit more deeply, I know Vietnamese idioms through my mum's telling me bedtime stories based on the old folktales. When I was a teacher, it seemed that not many of my students know the folktales I was told as a child. At least to me, this is concerning because folktales convey many valuable and timeless life lessons, which teach a person how to behave in the society.
🌟What I would do next?🌟
This has been a fun topic to write about. If I had time, I would make a dataset of Vietnamese-English translations for more idioms and slangs, prompt GPT-40, and calculate the COMET score of the machine translation output. That way, I would make a more solid claim about GPT-4o's ability to translate through prompting.
Other than that, I would like to dive into prompting for Vietnamese-English translation of domain-specific texts, such as medical text (a topic I have been interested in since I was 9). Translating domain-specific text is definitely a harder task than translating idioms and slangs (I had some gigs translating economic texts, and I found translating them difficult).
Work mentioned:
N. Palanichamy and P. Trojovský, “Overview and challenges of machine translation for contextually appropriate translations,” iScience, vol. 27, no. 10, p. 110878, 2024. doi: 10.1016/j.isci.2024.110878. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S2589004224021035.