The project was aimed to build a pre-training step before performing translation which would complement the source sentence information.
The overall idea was to learn an encoder that would project the embeddings of different languages to a semantic space which would capture its meaning.
This was achieved by using a contrastive learning strategy but without any explicit augmentations to the parallel sentences dataset.
This work was accepted at the ML Pre-registration workshop at NeurIPS '22.
Studied the basics of multi-modal, specifically Vision-Language (V-L) learning by reviewing and compiling notes on several papers focusing on Visual Question Answering (VQA) and other V-L tasks.
Implemented a basic VQA models using PyTorch on a synthetic easy-VQA dataset.
Learnt the MMF framework by Meta-AI and using this experimented with several baseline VQA models including SOTA models like MMBT for radiological VQA.
Studied and implemented several pillars in the field of Neural Machine Translation from scratch on Multi30k dataset:
"Neural Machine Translation by Jointly Learning to Align and Translate"
Sequence to Sequence Learning with Neural Networks"
"Convolutional Sequence to Sequence Learning "
Implemented state-of-the-art models like Transformers from scratch and compared its performance to the previously mentioned models.
Implemented several character-level language generation models like RNN, LSTM and GRU using PyTorch for generating dinosaur names.
Extended this to models that could generate paragraphs that were trained on a custom-curated Harry Potter book corpus.
Explored the application of transformers to the visual modality by reviewing several transformer-based vision models.
Implemented "Visual Transformers: Token-based Image Representation and Processing for Computer Vision" from scratch for image classification
Learnt about the basics of deep learning including models like Dense Neural Nets (DNNs) and Convolutional Neural Networks (CNNs).
Implemented several image classifiers on a personally created dataset to differentiate between memes and notes.
Compared the performance of CNNs and DNNs on this and other datasets like CIFAR-10 and MNIST.