Reinhard Rapp
How to Build Your Own High Quality Neural Machine Translation System Using Marian NMT
If you are interested in neural machine translation (NMT) and wish to create your own system, this webpage is meant to give you a step by step introduction. It is based on a tutorial which took place at the tcworld conference on November 12, 2021 (see the announcement below) and on an article in the tcworld magazine whose online-version can be found here: Neural Machine Translation – Explained - tcworld magazine. Both activities were follow-ups of my participation in the Similar Language Translation Shared Task at the 6th Conference on Machine Translation (Nov. 2021), see the reference below.
My recommendationis that you first read the tcworld article to get an overview on the topic. If this motivates you to give it a try, you find below further details in nine steps together with the necessary scripts. The description occasionally refers to the language pair German to English, but the scripts are of a more general nature and can be easily adapted to many language pairs.
If you encounter problems, please look at the Marian NMT examples (see step 4 below) or at Adam Geitgey's NMT tutorial. Both documents are very helpful and much of what you see here are variations of this previous work. But if you need further help or if you have comments on this webpage, feel free to send me an e-mail at reinhardrapp (at) gmx (dot) de
Should this website be useful for your own writings, please consider citing one or both of the following underlying publications:
Reinhard Rapp (2021). Similar language translation for Catalan, Portuguese and Spanish using Marian NMT. Proceedings of the 6th Conference on Machine Translation, 292-298. https://aclanthology.org/2021.wmt-1.31/
Reinhard Rapp (2022). Neural machine translation - explained. tcworld magazine February 2022. https://tcworld.info/e-magazine/translation-and-localization/neural-machine-translation-explained-1167/
Step 1: Check if your hardware is suitable
A standard Windows PC or laptop can be used. It should not be an outdated PC as the following is recommended: at least 16 GB of RAM, reasonably fast CPU (e.g. Intel i5, i7, i9; AMD CPUs are also possible), at least 200 GB hard disk (conventional or SSD), nVidia GPU with at least 8 GB of graphics memory (i.e. on the GPU). Other manufacturer‘s GPUs (e.g. AMD) are not supported. If you do not have an appropriate GPU in your computer, you can do without, but must be prepared that the system requires several weeks for training. Fortunately, training is a one-time process per language pair (unless you wish to experiment with different parameters). If the system crashes during training (e.g. due to power failure), Marian MT will recover as it stores the training status in regular intervals.
Step 2: Install Ubuntu
Assuming you have a Windows PC, you need to install a Linux operating system. We recommend Ubuntu 20.04 (LTS, i.e. providing long term support) which has been tested extensively with Marian NMT. Installation instructions are given at https://ubuntu.com/download/desktop. In principle it is possible to test Ubuntu without changing anything on your hard disk by installing it on a USB stick. However, as for NMT we need a high performance system, this does not work here. However, it is no problem to install Ubuntu in addition to Windows. This way you will have a dual boot system: When starting the PC after installing Ubuntu, a boot manager shows up requesting which OS you wish to use.
During installation, Ubuntu asks how much hard disk space you wish to allocate for the system folder and for swap space. We recommend at least 50 GB for the system folder, and for the swap space at least half of your RAM size. A standard installation (rather than a minimal installation) helps to avoid having to install standard Linux tools manually later on. Note that the Windows NTFS file system can be accessed from Ubuntu, but Windows cannot access the Ubuntu file system.
Below you see a screenshot of the Ubuntu desktop where, on the left side, a column was added which gives an overview on the standard tools. The handling of Ubuntu is rather similar to Microsoft Windows or Apple's macOS. An advantage of Ubuntu is that it comes with loads of free software including, for example, Libre Office which is a free equivalent of Microsoft Office. So in most cases you won't have to worry about buying software, and also there are fewer issues with malware and privacy protection.
Step 3: Install CUDA
CUDA (Compute Unified Device Architecture) is nVidia‘s software allowing you to use their GPUs for purposes other than the screen graphics they were originally designed for. These other purposes include running neural network software such as Marian NMT. If you don’t have an nVidia GPU, you cannot install CUDA. In this case Marian NMT will still run but, as described above, the training will be very slow. Instructions on how to install CUDA can be found on nVidia’s website at https://developer.nvidia.com/cuda-downloads. The current CUDA version 11.5 should be fine.
Step 4: Install Marian NMT
For installing Marian NMT, its source code has to be downloaded from the internet and then it has to be compiled, i.e. executable code has to be generated from the source code. A few Linux commands are necessary to do this. For your convenience, we provide these in a so-called shell script, so that you don’t have to invoke many commands but only have to start the script in an Ubuntu terminal window (which is the equivalent of a DOS box (= MS-DOS command window) when running Windows.
Using the text editor (see screenshot in step 3) you have to copy the script below into a file which you name, for example, "installmarian.sh". The extension "sh" indicates that it is a shell script, i.e. a script that can be run in a Ubuntu command window by entering "./installmarian.sh". Don't forget to type "./" before the name of the script as this tells Ubuntu that the script is located in the current window. To save on keystrokes, you can only type the first few characters of the command, e.g. "./inst" and then press the "tab" key on your keyboard. Ubuntu will then complete the command based on the filenames it finds in the current directory. Note, however, that this shortcut does not work if there are other files starting with the same character sequence.
The script creates a directory "marian" in your home directory (which is the directory you start with when login in under Ubuntu) and installs Marian NMT there. Within the directory "marian", it creates a subdirectory "marian-examples" where a number of examples on how to use Marian NMT are provided. You don't need these examples by default, but should you get stuck with our instructions here or wish to go beyond, you may have a look.
(click here to see script)
# Download Marian NMT from its github repository and compile it
git clone https://github.com/marian-nmt/marian
cd ~
cd marian
mkdir build
cd build
cmake -DCOMPILE_SERVER=on -DCUDA_TOOLKIT_ROOT_DIR=/usr/local/cuda-10.1/ ..
make -j4
Step 5: Install tools for pre-processing and evaluation
To pre-process the corpus of human translations to be used for NMT training, we need a number of tools which can be downloaded from the Marian NMT website. For doing so and for installing them, we also provide a script on our website.
Also, within "marian-examples" a number of support tools are installed in a subdirectory "tools". These will be used later, for example, to do corpus pre-processing and to evaluate translation quality.
These are the tools that we need:
1) Tokenizer. It splits sentences into words.
2) Cleaning tool. It discards sentence pairs where either the sentence on the source or the target language side is empty, or where both sentences are of very different lengths. Such cases are likely to be noise in the data.
3) True-casing: Sentence-initial uppercase characters are lowercased if the respective word is more common in the lowercase version. E.g. “He went fast“ → “he went fast“
4) Subword segmentation: Words are split into high frequency substrings. E.g. nonsensical might be split into non, sens, and ical. As explained above, this helps NMT systems to deal with rare words not occurring in the training data.
5) BLEU evaluation (Papineni et al., 2002): This is a tool that, by computing string similarities, compares the machine translation of a sentence to its human translation which is assumed to be perfect. This produces values between 0 and 1, with 0 indicating a very bad machine translation which has no similarity to the human translation, and 1 indicating a perfect machine translation which is identical to the human translation. Figure 3 gives some idea how these BLEU scores are computed.
(click here to see script)
# Download and compile the examples and support tools that come with Marian NMT
cd ~/marian/
git clone https://github.com/marian-nmt/marian-examples.git
cd marian-examples/
cd tools/
make
Step 6: Download and split parallel corpus
For our first system we suggest to use e.g. the German/English portion of the Europarl corpus (Koehn, 2005). This corpus consists of translations of speeches held in the European Parliament and covers more than 20 official EU languages. Below you find a script that downloads the corpus and splits it into a large part used for training and two small parts for development and testing. The development set will be used for controlling progress during training, and the test set is used for evaluating the translation quality of the trained system. If later we want to further improve our system, we can add additional parallel corpora from other sources. We can either search the internet for “parallel corpus” and the names of the languages we are interested in. Alternatively, we can go to the OPUS website at https://opus.nlpl.eu/ (Tiedemann, 2012). This is the largest website of its kind and provides huge amounts of parallel corpora in many languages readily prepared for MT processing.
In the script below, please replace "SOURCE" by the international two-letter-code of your source language of choice, and "TARGET" by the two-letter-code of the target language. (e.g. "en" for English and "de" for German). Note, however, that only the combinations provided on the Europarl website at http://www.statmt.org/europarl/v7/ are possible.
(click here to see script)
# replace SOURCE and TARGET by language abreviations such as en, de, fr etc.
cd data
# get SOURCE-TARGET training data from Europarl website
wget -nc http://www.statmt.org/europarl/v7/SOURCE-TARGET.tgz -O europarl-SOURCE-TARGET.tgz
# extract data from archive
tar -xf europarl-SOURCE-TARGET.tgz
# shuffle files, i.e. bring the lines of the source and target files in random order
# It needs to be the same random order for both files. This is ensured by providing
# pseudo-random bytes (for simplicity we use here source file for this purpose)
shuf --random-source=europarl-v7.SOURCE-TARGET.SOURCE europarl-v7.SOURCE-TARGET.SOURCE > corpus-full.SOURCE
shuf --random-source=europarl-v7.SOURCE-TARGET.SOURCE europarl-v7.SOURCE-TARGET.TARGET > corpus-full.TARGET
# Make data splits for training, development and test sets
head -n -4000 corpus-full.SOURCE > corpus.SOURCE
head -n -4000 corpus-full.TARGET > corpus.TARGET
tail -n 4000 corpus-full.SOURCE > corpus-dev-test.SOURCE
tail -n 4000 corpus-full.TARGET > corpus-dev-test.TARGET
head -n 2000 corpus-dev-test.SOURCE > corpus-dev.SOURCE
head -n 2000 corpus-dev-test.TARGET > corpus-dev.TARGET
tail -n 2000 corpus-dev-test.SOURCE > corpus-test.SOURCE
tail -n 2000 corpus-dev-test.TARGET > corpus-test.TARGET
cd ..
Step 7: Apply pre-processing on parallel corpus
For this step, a script is provided which applies the pre-processing tools installed in step 5 to the training, development and test sets generated in step 6. Please again replace "SOURCE" and "TARGET" by the respective two-letter codes.
(click here to see script)
# replace SOURCE and TARGET by language abreviations such as en, de, fr etc.
# Suffix of source language files
SRC=SOURCE
# Suffix of target language files
TRG=TARGET
# Number of merge operations for byte pair encoding
bpe_operations=40000
# path to moses decoder: https://github.com/moses-smt/mosesdecoder
mosesdecoder=../tools/moses-scripts
# path to subword segmentation scripts: https://github.com/rsennrich/subword-nmt
subword_nmt=../tools/subword-nmt
# tokenize
for prefix in corpus corpus-dev corpus-test
do
cat data/$prefix.$SRC \
| $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $SRC \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $SRC > data/$prefix.tok.$SRC
cat data/$prefix.$TRG \
| $mosesdecoder/scripts/tokenizer/normalize-punctuation.perl -l $TRG \
| $mosesdecoder/scripts/tokenizer/tokenizer.perl -a -l $TRG > data/$prefix.tok.$TRG
done
# clean empty and long sentences, and sentences with high source-target ratio (training corpus only)
$mosesdecoder/scripts/training/clean-corpus-n.perl data/corpus.tok $SRC $TRG data/corpus.tok.clean 1 80
# train truecaser
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/corpus.tok.clean.$SRC -model model/tc.$SRC
$mosesdecoder/scripts/recaser/train-truecaser.perl -corpus data/corpus.tok.clean.$TRG -model model/tc.$TRG
# apply truecaser (cleaned training corpus)
for prefix in corpus
do
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.clean.$SRC > data/$prefix.tc.$SRC
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.clean.$TRG > data/$prefix.tc.$TRG
done
# apply truecaser (dev/test files)
for prefix in corpus-dev corpus-test
do
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$SRC < data/$prefix.tok.$SRC > data/$prefix.tc.$SRC
$mosesdecoder/scripts/recaser/truecase.perl -model model/tc.$TRG < data/$prefix.tok.$TRG > data/$prefix.tc.$TRG
done
# train BPE
cat data/corpus.tc.$SRC data/corpus.tc.$TRG | $subword_nmt/learn_bpe.py -s $bpe_operations > model/$SRC$TRG.bpe
# apply BPE
for prefix in corpus corpus-dev corpus-test
do
python3 bpe.py -c model/$SRC$TRG.bpe < data/$prefix.tc.$SRC > data/$prefix.bpe.$SRC
python3 bpe.py -c model/$SRC$TRG.bpe < data/$prefix.tc.$TRG > data/$prefix.bpe.$TRG
done
Step 8: NMT training
After pre-processing, the corpora are ready to be used for training. Training is conducted by a script that starts Marian NMT in training mode and provides settings for all required parameters. These parameters include, for example, ones that tell Marian NMT which architecture to use (in our case the transformer), or how many GPUs are available for training.
The training script below calls another small script called "validate.sh" which Marian NMT requires to conduct BLEU evaluations from time to time during training. This helps the user to see whether the training is running correctly by checking whether or not the BLEU scores improve as expected. The validate.sh script is located in the "scripts" directory and looks as follows:
cat $1 \
| sed 's/\@\@ //g' \
| ../tools/moses-scripts/scripts/recaser/detruecase.perl 2> /dev/null \
| ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l TARGET 2>/dev/null \
| ../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/corpus-dev-test.TARGET \
| sed -r 's/BLEU = ([0-9.]+),.*/\1/'
In this script, the string "TARGET" has to be replaced by the two-letter code of the target language.
(click here to see training script)
# replace SOURCE and TARGET by language abreviations such as en, de, fr etc.
MARIAN=../../build
MARIAN_TRAIN=$MARIAN/marian$EXT
MARIAN_DECODER=$MARIAN/marian-decoder$EXT
MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
MARIAN_SCORER=$MARIAN/marian-scorer$EXT
# set chosen gpus
GPUS=0
if [ $# -ne 0 ]
then
GPUS=$@
fi
echo Using GPUs: $GPUS
if [ ! -e $MARIAN_TRAIN ]
then
echo "Marian NMT is not installed in $MARIAN, you need to compile the toolkit first"
exit 1
fi
mkdir -p model
# CAUTION: No comments ("# ...") within Marian parameter list
# CAUTION: No blanks after line separator "\" in Marian parameter list
# train model
if [ ! -e "model/model.npz.best-translation.npz" ]
then
$MARIAN_TRAIN \
--devices $GPUS --sync-sgd --seed 1111 \
--model model/model.npz --type transformer \
--train-sets data/corpus.bpe.SOURCE data/corpus.bpe.TARGET \
--max-length 100 \
--vocabs model/vocab.SOURCETARGET.yml model/vocab.SOURCETARGET.yml \
--mini-batch-fit -w 10000 --maxi-batch 1000 \
--early-stopping 10 --cost-type=ce-mean-words \
--valid-freq 5000 --save-freq 5000 --disp-freq 500 \
--valid-metrics translation ce-mean-words perplexity cross-entropy \
--valid-sets data/corpus-dev.bpe.SOURCE data/corpus-dev.bpe.TARGET \
--valid-script-path "bash ./scripts/validate.sh" \
--valid-translation-output data/valid.bpe.TARGET.output --quiet-translation \
--valid-mini-batch 64 \
--beam-size 6 --normalize 0.6 \
--log model/train.log --valid-log model/valid.log \
--enc-depth 6 --dec-depth 6 \
--transformer-heads 8 \
--transformer-postprocess-emb d \
--transformer-postprocess dan \
--transformer-dropout 0.1 --label-smoothing 0.1 \
--learn-rate 0.0003 --lr-warmup 16000 --lr-decay-inv-sqrt 16000 --lr-report \
--optimizer-params 0.9 0.98 1e-09 --clip-norm 5 \
--tied-embeddings-all \
--exponential-smoothing \
--overwrite --keep-best
fi
Step 9: Translation and evaluation
Once training is completed, using the respective script, marian-decoder is started which means that Marian NMT is now used in the mode used for translating sentences. Hereby the sentences of the test set are translated and the results are automatically evaluated using the BLEU evaluation tool. Alternatively, any other source language text can also be translated. In this case, an automatic evaluation will usually not be possible as this would require a human translation for comparison. Fortunately, this is normally not a major problem as we know about the quality of the translations from the test set, and we can assume that similar types of texts will typically be translated with similar quality.
(click here to see script)
# replace SOURCE and TARGET by language abreviations such as en, de, fr etc.
MARIAN=../../build
MARIAN_TRAIN=$MARIAN/marian$EXT
MARIAN_DECODER=$MARIAN/marian-decoder$EXT
MARIAN_VOCAB=$MARIAN/marian-vocab$EXT
MARIAN_SCORER=$MARIAN/marian-scorer$EXT
# set chosen gpus
GPUS=0
if [ $# -ne 0 ]
then
GPUS=$@
fi
echo Using GPUs: $GPUS
if [ ! -e $MARIAN_TRAIN ]
then
echo "marian is not installed in $MARIAN, you need to compile the toolkit first"
exit 1
fi
# translate dev set
cat data/corpus-dev.bpe.SOURCE \
| $MARIAN_DECODER -c model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
--mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
| sed 's/\@\@ //g' \
| ../tools/moses-scripts/scripts/recaser/detruecase.perl \
| ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l TARGET \
> data/corpus-dev.SOURCE.output
# translate test set
cat data/corpus-test.bpe.SOURCE \
| $MARIAN_DECODER -c model/model.npz.best-translation.npz.decoder.yml -d $GPUS -b 12 -n1 \
--mini-batch 64 --maxi-batch 10 --maxi-batch-sort src \
| sed 's/\@\@ //g' \
| ../tools/moses-scripts/scripts/recaser/detruecase.perl \
| ../tools/moses-scripts/scripts/tokenizer/detokenizer.perl -l TARGET \
> data/corpus-test.SOURCE.output
# calculate bleu scores on dev and test set
../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/corpus-dev.TARGET < data/corpus-dev.SOURCE.output
../tools/moses-scripts/scripts/generic/multi-bleu-detok.perl data/corpus-test.TARGET < data/corpus-test.SOURCE.output
Translation results
To give an impression of the results, below you find a few German sentences which were randomly selected from the test set together with their English translations as produced by the system. It can be seen that the translations are fairly good, though not perfect. The automatic evaluation gave a BLEU score of 38.0 for German to English. When comparing to commercial systems such as Google Translate you will see that they do even better. However, this comparison is not fair as our system was trained with a corpus which is tiny in comparison to what the big players are supposed to use. Fotunately, if we also want to further improve, nobody hinders us to also train our system with much larger corpora, as can be downloaded e.g. from the OPUS website at https://opus.nlpl.eu/. Only to make our start easier and the training less time consuming, we limited ourselves to the Europarl corpus here.
German original sentences and their machine translations as produced by the system
DE: Es ist noch zu früh, die Folgen dieser Krise für die Realwirtschaft zu quantifizieren.
EN: It is still too early to quantify the impact of this crisis on the real economy.
DE: Der von der Regierungskoalition gewählte Name ist sicherlich kein Zufall.
EN: The name chosen by the coalition government is certainly not a coincidence.
DE: In einem Rechtsstaat - und das sind wir doch wohl - kann es vor dem Gesetz keine Ausnahmen geben.
EN: In a state governed by the rule of law - and we are surely - there can be no exceptions before the law.
DE: Die Besteuerung von Personenkraftwagen lässt sich nicht von der allgemeinen Steuerregelung trennen, zu der die Mehrwertsteuer, die Einkommensteuer, die Verbrauchssteuern gehören, die im Übrigen ein Instrument für Haushaltseinnahmen darstellt und als solches der Souveränität der Staaten unterliegen muss.
EN: The taxation of passenger cars cannot be separated from the general tax regime, which includes VAT, income tax and excise duties, which are, moreover, an instrument of budgetary sovereignty and, as such, must be subject to the sovereignty of the Member States.