Natural Language Generation(NLG) focuses on building systems that generate text close to our understanding of language. The advent of Neural Networks has seen new methods of text generation being proposed which surpass the standards of the methods previously used in terms of accuracy and variety of text generated. In our work, we focus on a hybrid approach which encompasses the generation of grammar as a precursor to context generation. The context being generated conditioned according to the grammar generated hence ensuring that the text generated is grammatically correct. We explain the model and provide some preliminary results of the system.
For practical viability, NLG can be thought of as an automation for everyday document creation in places such as court-rooms, government offices, news channels, etc. A detailed survey on NLG techniques is outlined in [1]. The increasing demand for diversity and variety in such reports call for more sophisticated techniques for NLG, and with Neural Networks, the scope for research in this area has broadened, leading to works such as SeqGAN [2], Long Short Term Memory(LSTM) Recurrent Neural Networks [3] for language generation.
We explored LSTM-RNN and Generative Adversarial Networks[4] as possible generation techniques for grammatically correct and contextually sound sentences.
We learn from the experiments that neural networks and other machine learning techniques don’t account for the grammar behind the training text. We propose our model, GrammarGAN, as an attempt to generate sentences while respecting the underlying structure in terms of grammar.
A GAN comprises of two components - generator and discriminator which play a minimax game to out-learn each other.
The update equations for Generator G and Discriminator D are:
LSTM Networks address the problem of long term dependencies associated with traditional neural networks. The LSTM Network tries to learn the probability distribution of the training corpus. Each word in the corpus is treated as a category and the categorical loss is to minimized and is given by:
where n is the number of categories, y_i is a binary value defining the presence or absence of that category and p_i is the probability of prediction of the category.
Our model combines the generation capacity of GANs with the contextual maintenance of LSTM RNNs. We pre-process data from our corpus and give POS tags to sentences of fixed length. These tag sequences are used as training data for the GAN. The generator once trained will generate sequence of tags t' = [t'_0 , t'_1, ... , t'_N] of specified length N. Next, we select the seed pattern w' = [w'_0, w'_1, ... , w'_l-1] corresponding to the first l-1 tags randomly from the corpus.
In our model, the word to be predicted is based on a seed word pattern of length l and the Part-of-Speech tag of the next word. Hence, given a seed word pattern [w_0 , w_1, w_2, ... , w_l ] where w_i are words from the corpus and corresponding part-of-speech tags [t_0, t_1, t_2, ... , t_l , ... , t_m] and m>l, we predict [w_l+1, ..., w_m]. The prediction for the j th word is made in the following manner:
Note that l will be the moving window size. The window refers to the number of previous words the next word is conditioned on. The subsequent words are predicted by the trained LSTM Network which predicts the word category based on the previous words and the tag at the corresponding location in t_0 generated by the generator, according to Equation (4).
At the pre-processing stage, the corpus is first translated into a set of sentences which are then broken down into a list of words. Based on these list of words, we prepare a list of part-of-speech tags corresponding to the words using the Default NLTK Tagger which has an accuracy of 89.56%. We maintain a corresponding integer mapping to each of the 32 tags.
Three experiments were performed with the GAN based on the sequence length generated:
We select tag lists of sentences with (number of words <= sequence length of tags to generate). The sentences with smaller length are padded with zeros so that the training data dimensions are consistent.
We use a binary match evaluation metric for assessing the generated tags. Before evaluation, each tag sequence is assigned with value equal to the length of sequence. The error value is decremented by one for every match in the corpus. The accuracy is calculated by subtracting the average error from the sequence length. The training time for all the three models was set to 50 hours. A few results for the three experiments and accuracy of the models given in Table 1 below.
Table 1: POS sequences, Generator samples
The last word of the sequence is replaced with the tag of the actual next word. Appropriate integer are created for both the tags and words. Models are fitted by categorical loss. We gradually decrease the length of seed pattern and increase the prediction pattern. Some results are mentioned in Table 2 below:
Table 2: POS Conditioned tokens generated by LSTM
Note: The training time for the LSTM Network on Nvidia Quadro P5000 machine was 141 hours (6 days) for Trump dataset and 319 hours (14 days) on Wikipedia dataset. Training the model on the Trump dataset takes 24 days on the same machine without an Nvidia GPU, which demonstrates the benefits offered by the capabilities of the Nvidia Quadro GPU.
The results of the combined system of GAN and LSTM are shown in Table 3 below:
Table 3: The table describes the tag sequence generated by the generator. A seed word is chosen from the corpus corresponding to the first tag in the sequence. The predicted words are generated based on the previous generated word and the corresponding tag in the tag sequence.