Documentation


Our GitHub Repository - you can find instructions to install AistrighNLP here


Academic Paper

Below is a first draft of an academic paper we are working on. We are working towards publishing this with researchers.

IRISH INITIAL MUTATIONS IN NEURAL MACHINE TRANSLATION

Below you can find links to all our models, datasets and binarised corpora associated with the projects. If you can't find what you're looking for please feel free to email us at aistrightranslation@gmail.com. All the files are linked by Google Drive, we assure you they are safe. Some files are compressed into a tar to allow us to host all the files on our Drives. If you wish to receive the files another way, please contact us and we will try to facilitate you the best we can.


Machine Translation

These are our pretrained Translation models. For optimal use, an NVIDIA CUDA-compatible GPU is preferred, but these work perfectly on a CPU too (provided you have the RAM)! All of these models run on Fairseq. For demonstration purposes, we recommend installing our fork of Fairseq. You'll also need to download subword-nmt.

  1. Extract the tar file (double click the file on Mac, see here for Linux).

  2. In the resulting folder, should be everything to use, test or continue training the models.

  3. To use these models interactively (like how you use Google Translate), in a terminal go to the extracted file folder and run (without the $);

$ ./translate.sh

All of our Demutated Models come with the CPU version of our top neural network to reapply mutations (for demonstration purposes). If you wish to use a GPU version, download the GPU neural network from the 'Mutation Neural Networks' section and replace the CPU model folder with the GPU model folder (you'll also need to change the name in 'translate.sh')

Baseline Transformer English-Irish

**In our project report we say this has a 1024 Feedforward NN. This is a mistake. It has a 2048 FF NN.

base_tran_enga.pt

Demutated Transformer English-Irish (Recommended)

**In our project report we say this has a 1024 Feedforward NN. This is a mistake. It has a 2048 FF NN.

We provide our best Neural Network to reapply mutations in translation mode in this package below.

dem_tran_enga.pt

NOTE: To use this to generate translations interactively, download our Fairseq fork;

$ git clone https://github.com/JustCunn/fairseq.git

$ cd fairseq

$ pip install --editable ./

$ ###OR ON MAC###

$ CFLAGS="-stdlib=libc++" pip install --editable ./

Convolutional English-Irish

Demutated Convolutional English-Irish

We provide our best Neural Network to reapply mutations in translation mode in this package below.

dem_cnn_enga.pt

NOTE: To use this to generate translations interactively, download our Fairseq fork;

$ git clone https://github.com/JustCunn/fairseq.git

$ cd fairseq

$ pip install --editable ./

$ ###OR ON MAC###

$ CFLAGS="-stdlib=libc++" pip install --editable ./

IWSLT (DE-EN) Transformer English-Irish**

**In our project report we say this has a 2048 Feedforward NN. This is a mistake. It has a 1024 FF NN.

iwslt_tran_enga.pt

Transformer Irish-English

Mutation Neural Networks

These are our pretrained neural networks that remutate 'demutated' Irish text (like that outputted by our demutated NMT models). nn_100k.pt achieves 99.5% accuracy and is the model we recommend. If your device isn't CUDA compatible (if you'll be using this directly on a laptop, etc.), download the ..._cpu version of the model.

nn_100k - 99.5% Accuracy (Recommended)

nn_100k

nn_100k_cpu (ONLY IF YOUR DEVICE IS NOT CUDA COMPATIBLE)

dataset

nn_125k - 98.1% Accuracy

nn_125k

nn_125k_cpu (ONLY IF YOUR DEVICE IS NOT CUDA COMPATIBLE)

dataset

nn_40k - 97.6% Accuracy

nn_40k

nn_40k_cpu (ONLY IF YOUR DEVICE IS NOT CUDA COMPATIBLE)

dataset

nn_50k - 97.6% Accuracy

nn_50k

nn_50k_cpu (ONLY IF YOUR DEVICE IS NOT CUDA COMPATIBLE)

dataset

nn_all - 97.5% Accuracy

nn_all

nn_all_cpu (ONLY IF YOUR DEVICE IS NOT CUDA COMPATIBLE)

dataset

Corpora

Full Corpus + DGT Dataset (Broken Down Individually Before Filtering)

Full Corpus + DGT Dataset (As is in our models)

Survey on Linguistic Olympiad Participants

Results (Google Sheet and CSV)

Question Sheet (For more responses)

Gated-Graph Neural Network (MT)

This is our OpenNMT-py model that we used when experimenting with Tree-to-String (EN-GA) Translation. Unfortunately, after hours upon hours of debugging, we could not get it to converge correctly, likely due to the low amount of clean data in the EN-GA pair. We do not recommend using this model to translate, at all!

GGNN Model

Project Book:

aistrigh_book.pdf

Project Diary

aistrigh_diary.pdf

PowerPoint:

aistrigh_display.pptx

Video: