Handwritten-text-GANS

Aim

To refine synthetic handwriting data using limited real data so as to improve word-level classification accuracy of a model trained on the synthetic data.
To study various GANS model on how they can be used for handwritten text generation.

Using DC Gan

Link to the tutorial => https://medium.com/@ekss1121/generative-adversarial-networks-b9f80e6d7679

Architecture

Layer (type)                 Output Shape              Param #

=================================================================

input_1 (InputLayer)         (None, 100)               0

_________________________________________________________________

dense_1 (Dense)              (None, 6272)              633472

_________________________________________________________________

leaky_re_lu_1 (LeakyReLU)    (None, 6272)              0

_________________________________________________________________

reshape_1 (Reshape)          (None, 7, 7, 128)         0

_________________________________________________________________

up_sampling2d_1 (UpSampling2 (None, 14, 14, 128)       0

_________________________________________________________________

conv2d_1 (Conv2D)            (None, 14, 14, 64)        204864

_________________________________________________________________

leaky_re_lu_2 (LeakyReLU)    (None, 14, 14, 64)        0

_________________________________________________________________

up_sampling2d_2 (UpSampling2 (None, 28, 28, 64)        0

_________________________________________________________________

conv2d_2 (Conv2D)            (None, 28, 28, 1)         1601

_________________________________________________________________

activation_1 (Activation)    (None, 28, 28, 1)         0

=================================================================

_________________________________________________________________

_________________________________________________________________

Layer (type)                 Output Shape              Param #

=================================================================

input_2 (InputLayer)         (None, 28, 28, 1)         0

_________________________________________________________________

conv2d_3 (Conv2D)            (None, 14, 14, 64)        1664

_________________________________________________________________

leaky_re_lu_3 (LeakyReLU)    (None, 14, 14, 64)        0

_________________________________________________________________

dropout_1 (Dropout)          (None, 14, 14, 64)        0

_________________________________________________________________

conv2d_4 (Conv2D)            (None, 7, 7, 128)         204928

_________________________________________________________________

leaky_re_lu_4 (LeakyReLU)    (None, 7, 7, 128)         0

_________________________________________________________________

dropout_2 (Dropout)          (None, 7, 7, 128)         0

_________________________________________________________________

flatten_1 (Flatten)          (None, 6272)              0

_________________________________________________________________

dense_2 (Dense)              (None, 2)                 12546

=================================================================

_________________________________________________________________

Results(after 2400 epochs)

Loss Original Data Generated Data

Conclusion

Some letters were easy to generate such as B, e, U.
They do not provide desirable result on single letter generation. As the fonts are still blur and some are not recognizable.
We did not provide any image to the generator to create a synthetic image.
We were not able to set any condition on the generates images, such as font or which character to produce

Using Pix2Pix

Link => https://arxiv.org/pdf/1611.07004.pdf

Requirements

1. Put conditions on the output, hence use a input image to be converted from one font to another

2. Produce font conversion from one word to another

Hence we move to the pix2pix model which use a encoder-decoder network to get the input and produce output in different font

Why use gans(according to the paper)?

Minimizing euclidean distance between predicted and ground truth produces blurry images because it minimizes by averaging all plausible outputs, but we need sharp and realistic images.

Hence we use gans who's goal is to "make the output in distinguish-able from reality”(blurry images will be considered fake)

Architecture

Generator Discriminator

(0): Conv2d(6, 64, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1))

    (1): LeakyReLU(0.2, inplace)

    (2): Conv2d(64, 128, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)

    (3): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True)

    (4): LeakyReLU(0.2, inplace)

    (5): Conv2d(128, 256, kernel_size=(4, 4), stride=(2, 2), padding=(1, 1), bias=False)

    (6): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True)

    (7): LeakyReLU(0.2, inplace)

    (8): Conv2d(256, 512, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1), bias=False)

    (9): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True)

    (10): LeakyReLU(0.2, inplace)

    (11): Conv2d(512, 1, kernel_size=(4, 4), stride=(1, 1), padding=(1, 1))

    (12): Sigmoid() ))

Results(64 epochs)

Input Output Expected Input Output Expected

Conclusion

Model learns a good representation of the desired font
It makes spelling errors. Not consistent with the input image
Sharp images are produced with no blurring.

About the architecture

Generator uses U-net( like resnet) for efficient learning instead of simple encoder-decoder network
Very deep U-net
Paper proposes, using Patch Gan(where discriminator checks the NxN patches to predict whether the image is real or fake but currently we used the basic architecture

Using Cycle Gan

Link => https://arxiv.org/abs/1703.10593

Why use Cycle Gan?

While using pix2pix we encountered many spelling errors.
They do not require paired images as in case of pix2pix

What are CycleGan?

Cycle gan is a modification to the pix2pix architecture, generator and the discriminator use the same architecture.
Instead of training one GAN, we are training 2 GANS. First (F(x)) learns the mapping from X -> Y and the second (G(x)) learns the inverse mapping from Y->X. We want to make sure that G(F(x)) = x. This would mean that the model is cyclic consistent. Hence we have 2 GANs and a new loss function.

Cyclic consistency loss = Ex∼pdata(x) [F(G(x)) − x] +Ey∼pdata(y) [G(F(y)) − y].

3. Unlike pix2pix which required paired images, it does not require them but will use unpaired images of both sets(words of different font).

Network Model

Results

Real Font A Fake Font A Real Font B Fake Font B

Conclusion

Model learns a good representation of the desired font
Reduced spelling errors. Consistent with the input image
Capital Letters are not learned correctly even with increase in training time. Mostly because the data set has less capital letters.
We get font conversion and also its inverse font conversion.

About the architecture

Very similar to pix2pix
Loss now has 3 parts, 2 for GANS and 1 for cyclic consistency => L(G, F, DX, DY ) =LGAN(G, DY , X, Y ) + LGAN(F, DX, Y, X) + λLcyc(G, F)

TODO:

simGAN

New Results(40 epochs, 2 days)

More results can be found on this link -> https://web.iiit.ac.in/~shubh.maheshwari/test_12/index.html

We did the same experiment on Cycle Gans using a much larger data-set.

We took the training set of 6000, validation set 500.
Results were very similar to the previous experiments where we got good results for small words
But the model couldn't generate capital letters. I increased the the number of capital letters in the data-set but instead of learning, it either didn't write them or created straight lines, spelling errors etc.

One of the reasons can be =>

We also think that this model is not good fit to change the shape of object. We tried to run the model for converting a men's face to a look alike women's face. For that we used celebA dataset but the results are not good and images produced are quite distorted.

https://hardikbansal.github.io/CycleGANBlog/

4. The training data-set and test data-sets where all unique, no words were repeated.

Real Font A Fake Font A Real Font B Fake Font B

Capital Letter Conversion

To understand the issue of capital letter conversion we ran the same experiment but only for converting from one font to another.

Results are not as promising as the for small letters but its a start.

Results(Best)

more results => https://web.iiit.ac.in/~shubh.maheshwari/test_latest/index.html

Real Font A Fake Font A Real Font B Fake Font B

Using Bicycle-Gan

Why switch from cycle gans?

- It gave better results on capital letters.

- Can account for variance.

What is a Bicycle Gan

- It is based on the pix2pix architecture

- Similar to cycle gan which used the inverse relation to map back to the same image. Bicycle gan does that with the noise

- It uses an encoder for the noise used to generate the image and using KL divergence make sure the encoded noise was indeed the original noise used to generate the data.

Understanding Bicycle-Gan

We need a way to enforce the random noise to reduce the mode collapse on different z values

Method 1

1. One way to do this is is by using z to generate B using A and then use an encoder to get z^ (z -> B -> z^).

2. Now using an L1 loss together with loss for G and D. Figure (d), we can enforce the noise.

Method 2

1. We can reverse this by using E on B to produce Q(z/B) then using this and G generate B^. (B -> z -> B^)

2 . Authors noted that this method alone didn't provide the good results at test time. Hence they enforced KL-divergence on Q(z/B) and z(random Gaussian ) to produce much better results

3. Look at figure c for reference.