Sequence Correction by Denoising Autoencoder

An introduction to the Denoising Autoencoder

The Vanilla Autoencoder has shown great results in re-generating images from its latent space. Autoencoder tries to map input feature vector into latent space and re-generate original feature vector, by minimizing re-construction loss. It captures essential knowledge and removes noisy parts from it. A Denoising Autoencoder is also like Vanilla Autoencoder, but the only difference is in the input data representation and the training process of the model. Denoising Autoencoder tries to capture the joint distribution of the inputs. As we know that English sentences are made of finite set of grammatical rules, it is deterministically possible to capture the knowledge of rules by using Autoencoder. Thus, we use the Denoising Autoencoder to solve the problem of sentence completion.

Problems and Motivation for using Sequence Correction

Often, it is possible to come across senctences with missing articles a,an,the. We try to fill the gaps in sentences, by training a model: Train a model by feeding a corpus of complete sentences. For testing the model, a corpus of sentences, corrupted by removing articles is the input dataset, and we want the model to fill those missing articles in sentences accurately. Here, sentence representation will be very important because training of model completely dependent on it.

Dataset

Using NLTK python library and grammar rules, 2108 sentences have been generated
Each sentence is of fixed length 5. A sentence in this set might be meaningless but grammatically, it is correct.
Number of unique words: 118
Ratio (unique words):(number of sentences) = 118/2108 = 0.0559
2108 sentences are divided into two: 1686 Training-set sentences, and 422 Testing-set sentences

Matrix Representation of sentences

The 5 word sentences are represented by 5 one-hot vectors, each of length 118.

Denoising Autoencoder Architecture

The input to Autoencoder is corrupted sentences in matrix form. The Autoencoder tries to fill holes and produce output: a "corrected" matrix (with dimensions similar to input). The error function is the Mean-Squared-Error of the input matrix and the corrected matrix). The error is back propagated through the Autoencoder, in order to update node weights.

Tensorflow Implementation

Hyper-parameters

Batch size: 10
Training epoch: 10000
At every 1000 th iteration, an output is generated by the model

Output

Whole sentence will be generated from vector representation. It will represent two things:

Which article has been placed by Autoencoder
Autoencoder prediction for other words in sentence

Results

* <blank> represents that article is removed from original sentence

original sentence :- john has the books .

corrupted sentence :- john has <blank> books .

predicted sentence :- john has a books .

original sentence :- emmy has an angry .

corrupted sentence :- emmy has <blank> angry .

predicted sentence :- emmy has the angry .

original sentence :- i have the sticks .

corrupted sentence :- i have <blank> sticks .

predicted sentence :- i have the sticks .

original sentence :- a water is gray .

corrupted sentence :- <blank> water is gray .

predicted sentence :- the water is gray .

original sentence :- it has a water .

corrupted sentence :- it has <blank> water .

predicted sentence :- it has the water .

original sentence :- i have a sky .

corrupted sentence :- i have <blank> sky .

predicted sentence :- i have the sky .

original sentence :- the boards are black .

corrupted sentence :- <blank> boards are black .

predicted sentence :- the boards are black .

original sentence :- a stamp is black .

corrupted sentence :- <blank> stamp is black .

predicted sentence :- the coin is black .

original sentence :- serena has an acid .

corrupted sentence :- serena has <blank> acid .

predicted sentence :- serena has a acid .

original sentence :- a feather is blue .

corrupted sentence :- <blank> feather is blue .

predicted sentence :- the feather is blue .

Results - Confusion Matrix for Articles

Table 1: Confusion Matrix demonstrating accuracy of model