Trang chủ‎ > ‎IT‎ > ‎DEEP LEARNING‎ > ‎

Chatbot and Related Research Paper Notes with Images

Original Link:

I gathered here a good number of useful papers related to chatbot models in chronological order spanning about 4 years from 2014. There are a couple of papers in deep learning that are not related to conversational agents at all, but I deemed them useful as they may provide insights into creating new and different conversation models. Some papers are concerned with neural machine translation, but I added these because the techniques described can usually be adapted to chatbot models. The rest of the papers are either focused on chatbots or more generally on the seq2seq model.
For each paper I provided a link, the names of the authors, and github implementations of the paper (noting the deep learning framework) if I happened to find any. Since I tried to make these notes as concise as possible they are in no way summarizing the papers, but are merely a starting point to get a hang of what the paper is about, and to mention main concepts with the help of pictures.
Check my paper, for an organized, in-depth research survey based on most of the papers listed here, up until 2017.08.
Check this github readme post for several neural chatbot implementations.
I also divided the papers into 3 categories described earlier, placing them after the paper title.
  • [n-c] means this is a paper that is neither related to chatbots nor to nmt
  • [s2s] means that this paper is not specifically about chatbots but it is related to the seq2seq architecture or to other sequence to sequence NLP transduction tasks (like nmt)
  • [chat] means that this paper is concerned with some aspect of chatbots


Teaching Machines to Converse [chat]

Jiwei Li

Before starting the list of publications, that I have read and made notes on, I want to highlight here an amazing work that I came upon from Jiwei Li. His PhD thesis summarizes all of his most notable publications in the field of neural conversational agents, providing in my opinion a number of very interesting papers on experimenting with diverse approaches to make open-domain dialog agents better. Almost all of the publications mentioned here will appear later on this page, as I have read and enjoyed them thoroughly. Furthermore, the github link provided contains most of his works in Torch. List of publications discussed in the PhD thesis:

Generative Adversarial Nets [n-c]

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio
TensorflowTensorflowPytorchTheano (official)Keras
  • Generator and discriminator networks
  • Generator tries to mimic data distribution as closely as possible, so that the discriminator can't decide which sample is from the data and which by the generator
  • This is analogous to a 2 player minimax game
  • Both can be trained together with backpropagation
  • Alternate between k steps of optimizing D and on step of optimizing G
  • This whole function is for optimizing D, the second term is for optimizing G
  • Optimal: D(x)=0.5 and p_g=p_data

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation [s2s]

Kyunghyun Cho, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, Yoshua Bengio
Theano (official)PytorchTensorflow
  • Two RNNs for encoding and decoding of sequences, jointly trained
  • Equations in the paper using gated recurrent unit
  • They only looked at rescoring translation phrases, not generating

Sequence to Sequence Learning with Neural Networks [s2s]

Ilya Sutskever, Oriol Vinyals, Quoc V. Le
KerasNumpyTensorflow (official)NumpyTensorflowTensorflowTensorflow
  • Encoder-decoder with LSTM (pretty big architecture)
  • Words are reversed in source sequence for better performance
  • Left to right beam search decoder

Neural Machine Translation by Jointly Learning to Align and Translate [s2s]

Kyunghyun Cho, Dzmitry Bahdanau, Yoshua Bengio
Theano (official)KerasTensorflowTensorflow
  • It's hard to compress all information into the context vector, especially for long sequences
  • To solve this, we use soft search to allow the decoder to peek at the relevant source words, and we don't encode into fixed length vector
  • Distinct context vector for each target word
  • Annotation for each source word with strong focus on parts sorrounding it, then the i-th context vector is the weighted sum of these annotations
  • These weights computed by alignment model which scores how well the inputs around position j and the output at position i match
  • Alignment model as a feedforward network trained jointly with translation model
  • Encoder is bidirectional RNN, thus the hidden states represent words both before and after the source word
  • Target word probability computed with multilayet network with a single maxout
  • Bleu 28.45, probably because it's a shallow network

Neural Turing Machines [n-c]

Alex Graves, Greg Wayne, Ivo Danihelka
  • RNNs with memory, differentiable end-to-end->trainable with gradient descent
  • Neural network controller interacts with memory bank (matrix) with read and write heads
  • NxM memory matrix, N adresses with M long vectors at each adress
  • Read and writes are blurry, "focus" determines how specific is the addressing
  • Writing operation composed of erase and add vectors.
  • Weighting vector is also used for reading and writing, constructed by using controller outputs, memory matrix, and previous weighting vector; operations to get new vector: content adressing->interpolation->convolutional shift->sharpening
  • Content- and location based adressing implemented in the above flow chart to get the weighting, which allows to only use previous weighting, to interpolate it with content based adress, or to shift it to next adress. Shift can be blurry so sharpening is needed
  • Copy experiment trained on sequences up to length 20, able to generalize with minor errors up to length 50 (much better than LSTM)
  • Repeated copy can generalize as well to longer sequences and more copy steps

Neural Responding Machine for Short-Text Conversation [chat]

Lifeng Shang, Zhengdong Lu, Hang Li
  • Encoder-decoder model applied to twitter style 2-turn conversations, with bahdanau attention and GRU and beam search for decoding
  • Combines the bahdanau attention model with the original global context vector representation
  • Evaluation done with human judgment

A Neural Conversational Model [chat]

Oriol Vinyals, Quoc V. Le
  • IT helpdesk dataset and movie subtitles; Big architectures and big vocabs
  • Input sequence is what has been conversed so far (context), output sequence is the reply
  • Objective function optimized is not the actual objective achieved through human communication
  • Problem mentioned is with the inconsistent answers (there is no personality) and with not being able to evaluate correctly :(

A Neural Network Approach to Context-Sensitive Generation of Conversational Responses [chat]

Alessandro Sordoni, Michel Galley, Michael Auli, Chris Brockett, Yangfeng Ji, Margaret Mitchell, Jian-Yun Nie, Jianfeng Gao, Bill Dolan
  • Encode past information, which is then decoded to promote responses
  • Separate context from last message
  • They use IR to generate more responses to a (c,m,r) triple based on bag of words
  • They use a ton of features together with the neural network models to generate likely responses

Learning to Transduce with Unbounded Memory [n-c]

Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, Phil Blunsom
  • Neural stacks, queues and deques -> effective hierarchical structure for NLP transduction problems
  • In contrast to LSTM, these can generalize to much longer sequences than seen at training
  • Continuous push and pop operations, which mean the degree of certainty of pushing or popping
  • RNN is controlling the stack
  • Read,pup,push and other vectors are concatenated as the input to the RNN
  • Shown to work really well for seq copying, inversing and other transduction tasks

Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models [chat]

Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, Joelle Pineau
Theano (official)TheanoTheanoTensorflowTensorflow
  • Non-goal driven dialog systems, incorporating NL understanding, reasoning, decision making and generation
  • HRED (introduced by Sordonni 2015a) -> encoder RNN encodes tokens appearing in utterance and context RNN takes this context vector as input and encodes the temporal structure of utterances apprearing so far in the dialogue with GRU (good diagram in the paper). The decoder takes the output of the context RNN at the timestep and generates the response with beam search
  • Speech acts, pause and end of turns included as separate tokens
  • Bidirectional RNN to summarize the information in forward and backward chain of the tokens
  • Pretrained word embeddings on huge google corpus are used to capture more info, and pretraining of the HRED model is done on a Q&A dataset.
  • Training done on movie dialog triples, 10k vocab stripped of person names and numbers
  • For evaluation perplexity and word error rate is used although not sure how good is it to use it
  • A lot of generic "i don't know" answers, because there are too many punctuations and pronoun tokens (maybe semantic structure should be separated from syntactic structure). Also usualy metrics don't capture the similar semantic content, thus they do not correlate with the objective

A Diversity-Promoting Objective Function for Neural Conversation Models [chat]

Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, Bill Dolan
Torch (official)
  • Maximum Mutual Information instead of likelihood of output used for objective function
  • Conventional neural models assign high probability to safe responses
  • Propose to capture the intuition that likelihood of message for a given response should be taken into account
  • To maximize this N-best lists is used with beam search, then rerank the N-best lists using the second term log(p(S|T))
  • Trained maximum likelihood models and used the MMI criterion above only during testing, also a parameter is used that takes into account seq length
  • Another approach is to use Log(p(T)) for MMI, but this can lead to ungrammatical outputs, solution is to multiply LM by * weights, thus the first words are more diverse and then it gets closer to a LM
  • Multireference BLUE used (better for dialog evaluation), with references extracted with IR methods

Attention with Intention for a Neural Network Conversation Model [chat]

Kaisheng Yao, Geoffrey Zweig, Baolin Peng
  • Encoder, intention, decoder RNN structure
  • Similar to HRED, but the output of decoder is also fed directly to the encoder RNN, and encoder RNN output is also directly fed to decoder network
  • Basically a HRED with bidirectional attention

Neural GPUs Learn Algorithms [n-c]

Łukasz Kaiser, Ilya Sutskever
Tensorflow (official)TensorflowTheanoTorch
  • Similar to NTM, but it's parallel and shallow
  • Using convolutional GRUs (architecture described in the paper)
  • Can do long binary addition and multiplication much better than stack RNN or LSTM with attention
  • Grid search to train 729 models, curriculum learning to go to longer inputs only if accuracy is good
  • Gradient noise, hard gate cutoff
  • Small dropout on recurrent connections helps generalization
  • 6 identical sets of non-shared parameters are used, at different time stpes, thus it can perform different operations at different time steps
  • Above thing is called relaxation, as the model converges the 6 sets are forced to unify
  • This relaxation has the potential to improve any RNN training

A Survey of Available Corpora for Building Data-Driven Dialogue Systems [chat]

Iulian Vlad Serban, Ryan Lowe, Peter Henderson, Laurent Charlin, Joelle Pineau
  • This is an article I didn't fully read!
  • Long paper full of useful corpora classified into categories, also discussion of metrics and of data pre-processing techniques!!!
  • Remove acronyms, slang, misspellings and stem and lemm (depending on task); also tokenization (defining the smallest unit of input)
  • Speaker segmentation with small gold corpus, and then iteratively segmenting the rest

Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond [s2s]

Ramesh Nallapati, Bowen Zhou, Cicero dos Santos, Çaglar Gulçehre, Bing Xiang
  • Applying bahdanau model to summarization, but uses features like POS, TF-IDF together with word embeddings in 1 big vector
  • Switching decoder/pointer (to source) architecture to handle OOV words, by copying them from document to summary
  • Two bi-directional encoders for word level and sentence level, they both have attention, word level attention is affected by sentence level attn.
  • Additional positional info is embedded in the sentence-level RNN

Incorporating Copying Mechanism in Sequence-to-Sequence Learning [chat]

Jiatao Gu, Zhengdong Lu, Hang Li, Victor O.K. Li
Theano (official)
  • Seq2seq model incorporating a copying mechanism, with which it can directly copy parts of input sequences
  • Similar to bahdanau attention model with differences: prediction is based on two modes(generate,copy), where copy-mode picks words from source
  • In addition to vocab it uses all the words in source sentence (even OOV) when using location based copying.
  • Mixing probabilities of copy-mode and generate-mode(same as bahdanau) with same normalization term to make them compete through softmax
  • Selective read from M attention matrix is used, which bears the location of the word in the source.
  • Both semantics and location of source word encoded into hidden states in M, for attentive and selective read

A Persona-Based Neural Conversation Model [chat]

Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, Bill Dolan
Torch (official)
  • Capture background information and speaking style which is a persona
  • Incorporate speaker and addressee vectors into seq2seq
  • We add the speaker embedding vector to each input of the unrolled decoder LSTM together with the words of response
  • Speaker embedding vector is learned through normal backprop together with other params (like word embeddings, but separate)
  • Speaker-adressee model: combine the user vectors -> same speaker will react differently to different adressees
  • The diversity promoting objective function is used, namely an inverse seq2seq is trained without speaker info to get log(p(S|T))
  • They trained on opensubtitles and then adapted the model to friends conversations (also trained another model on twitter (c,m,r) triples)
  • There are still errors but pretty consistent and diverse answers

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation [chat]

Chia-Wei Liu, Ryan Lowe, Iulian V. Serban, Michael Noseworthy, Laurent Charlin, Joelle Pineau
  • BLEU not good where responses are diverse with no matching words, deltaBLEU is weak and needs human annotation for multiple reference replies
  • BLEU is based on n-grams, METEOR produces alignment between response and ground truth, ROUGE is based on longest common subsequence
  • Greedy matching is based on matching words with closest embedding vectors in response and truth, embedding average: sentence level embedding
  • They all correlate (with human judgment) poorly on twitter dataset and not at all on ubuntu dataset

Latent Predictor Networks for Code Generation [s2s]

Wang Ling, Edward Grefenstette, Karl Moritz Hermann, Tomas Kocisky, Andrew Senior, Fumin Wang, Phil Blunsom
  • Attend over structured inputs to generate code implementations of card descriptions (2 types of inputs: text fields and singular fields)
  • Structured attention implemented with character embeddings, Bi-LSTM, linear and tanh projections, ending with softmax for probabilities
  • Probability over multiple predictors that can generate multiple segments of arbitrary length at time step t (ex: "copy name", "generate char")
  • Objective function is marginal log likelihood over a latent variable (representing a sequence of pairs of predictors and generated strings)
  • 3 types of predictors: char generation (softmax over chars), copy singular field (100% copy), copy text field (pointer network learns probability of copying)
  • Decoder with beam search takes best predictor and best string corresponding to the predictor at each time step to generate most likely code
  • Code compression -> replace commonly generated words (public, return) by tokens, to generate less characters
  • Only model to achieve non-zero accuracy, and better bleu scores than MT or seq2seq models

StalemateBreaker: A Proactive Content-Introducing Approach to Automatic Human-Computer Conversation [chat]

Xiang Li, Lili Mou, Rui Yan, Ming Zhang
  • Computer side should also be initiative and introduce new content when necessary, by stalemate breaking detected with keywords like "…" or "Errr"
  • When a stalemate is detected backtrack conversation history to find named entities, then search for related entities in knowledge graph
  • The system is retrieval and ranking based

A Network-based End-to-End Trainable Task-oriented Dialogue System [chat]

Tsung-Hsien Wen, David Vandyke, Nikola Mrkšic, Milica Gašic, Lina M. Rojas-Barahona, Pei-Hao Su, Stefan Ultes, Steve Young
  • Performs well across several metrics when trained on only a few hundred dialogues
  • Seq2seq model with dialogue history (belief trackers), and current database search outcome + they use MMI with reward function and beam search
  • Inputs into two representations: distributed representation by intent network (which can be a CNN), and probability distribution over slot-value pairs called the belief state. Then most probable values from belief state taken to form a query to the DB, and search result together with intent and belief state combined by policy network to form a single vector representing the next system action
  • Belief tracker keeps track of dialog state, using a smart weight trying strategy. It maintains a s multinomial distr. over values for each informable slot and binary distr. for each requestable slot
  • Each tracker is a recurrence from output to hidden layer RNN with a CNN feature extractor from user input and machine response
  • Summary belief vectors for each slot, and truth vector from DB (how much the entities match), and vector from intent network is used as input to policy network, to produce action vector
  • Generation LSTM uses attentive action vector to generate tokens that are delexicalised with pointers to entities in DB (3 informable, 7 requestable trackers

A Hierarchical Latent Variable Encoder-Decoder Model for Generating Dialogues [chat]

Iulian V. Serban, Alessandro Sordoni, Ryan Lowe, Laurent Charlin, Joelle Pineau, Aaron Courville, Yoshua Bengio
Theano (official)Tensorflow
  • Not enough variation in the models, only source of variation is through the conditional output distribution
  • HRED with latent variable (based on prior and posterior parametrization) at the decoder trained by maximizing variational lower-bound on the MLE
  • At test a sample latent variable is drawn from the prior for each sub-sequence, and concatenated with output of context RNN
  • At training this is drawn from the approximate posterior (parameterized by its own one-layer FFN, used to estimate the gradient of variational lower-bound
  • Similar to Variational Recurrent Autoencoder, but the latent variable is conditioned on all previous sub-sequences (sentences)

Sequence-to-Sequence Learning as Beam-Search Optimization[s2s]

Sam Wiseman, Alexander M. Rush
  • Training loss based on difference from target word is not represented in testing, also locally-normalized scores and exposure bias are bad
  • Proposing a non-probabilistic score for entire sequence and loss function in terms of errors made during beam search
  • !!Scheduled sampling!! = At training seq2seq select the target word at first to be the gold, and later to have higher probability to be the predicted word
  • Beam search is used at training as well to construct sequences, beams are changed when there is a margin violation in the loss of the previous seq
  • Model pretrained with standard word level cross-entropy, the size of the beam is increased gradually during training and dropout is also used

InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets [n-c]

Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, Pieter Abbeel
  • GANs are used to learn salient representations in an unsupervised way (like angle/thickness of MNIST digit)
  • Decompose the z noise variable in which the generator is based, into z (incompressible noise) and c (<- this targets the salient features)
    • c contains several latent variables/factors
  • Information-theoretic regularization in order to cope with trivial c-s
    • There should be high mutual information (MI) between c and the generator distribution G(z,c)
    • If it's high that means P_generator(c|x) has small entropy
    • P(c|x) is approximated with a lower bound of mutual information
  • The approximator and discriminator share parameters, and there is one final fully-connected layer to output the Q(c|x) distribution
  • It is shown that in a regular GAN the lower bound MI is 0, however by training to maximize it goes to maximal MI
  • Three latent variables are used, a categorical one for digit classifying, and two continuous ones for digit rotation and thickness
    • By varying each latent variable it is shown that it learns meaningful representations

Deep Reinforcement Learning for Dialogue Generation [chat]

Jiwei Li, Will Monroe, Alan Ritter, Michel Galley, Jianfeng Gao, Dan Jurafsky
  • Two virtual agents explore the space of possible actions while learning to maximize expected reward with policy gradient methods
  • An action is a dialog utterance taken according to the policy, a state is the previous two dialog turns, a policy is a enc-dec LSTM
  • Reward types:
    • ease of answering: how unlikely is that the response to the utterance will be a dull one based on mle based seq2seq probabilities
    • information flow: penalize semantic similarity between consecutive turns from same agent
    • semantic coherence: ensure the mutual information between action and previous turns
  • Curriculum learning is used such that first couple of tokens generated based on MLE, then switch to RL, and gradually reduce impact of MLE
  • Longer and more diverse simulated dialogues

An Attentional Neural Conversation Model with Improved Specificity [chat]

Kaisheng Yao, Baolin Peng, Geoffrey Zweig, Kam-Fai Wong
  • HRED with attention
  • Incorporating IDF in objective function(with log-likelihood), and reinforcement learning is used based on this to compute gradients
  • Training data is computer helpdesk stuff, model performs pretty well

Topic Aware Neural Response Generation [chat]

Chen Xing, Wei Wu, Yu Wu, Jie Liu, Yalou Huang, Ming Zhou, Wei-Ying Ma
Theano (official)
  • Represent people's prior knowledge about the topic, and embed this into reply of seq2seq model with attention
  • Two encoders with separate attention modules, one is bidirectional RNN, other is for topic words, then their attention is jointly fed into decoder
  • The two encoders can affect each others attention, topic attn finds relevant info, content attn determines the content focus
  • Topic word list obtained from twitter LDA model, they play the role of classification and association in response generation (better first words chosen)

Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation [s2s]

Jie Zhou, Ying Cao, Xuguang Wang, Peng Li, Wei Xu
  • Deep lstm enc-dec model with linear connections and an interleaved bi-directional architecture to stack the LSTM layers
  • There is a feed-forward network from the input nodes, fed into the current hidden state and the next layer together with previous hidden state
  • Alternate the RNN direction at different layers, two completely different encoders with different starting directions
  • Dropout is used and Attention is used from the vectors generated by the two encoders, and FF is used at decoder as well

Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation [chat]

Iulian Vlad Serban, Tim Klinger, Gerald Tesauro, Kartik Talamadupula, Bowen Zhou, Yoshua Bengio, Aaron Courville
  • Model multiple parallel sequences by factorizing the joint probability over the sequences
  • Hierarchical abstraction, information flows from high level sequences to low level ones
  • One sequence with the words, and another with coarse tokens (nouns for example)
  • Both sub-models r HRED, but the coarse predictor encoder encodes all previously generated tokens to a vector which is concatenated with the context RNN
  • Conditioned on the coarse sequence of higher level tokens the natural language sub-model generates a dialog utterance
  • 2 types of coarse representations: noun and activity-entitiy (extracting verbs and entities, only used for Ubuntu corpus)

Neural Discourse Modeling of Conversations [chat]

John M. Pierre, Mark Butler, Jacob Portnoff, Luis Aguilar
  • More previous conversational turns -> better models
  • Deixis, anaphora, logical consequence for measuring the relevance of the response to previous utterances

Neural Contextual Conversation Learning with Labeled Question-Answering Pairs [chat]

Kun Xiong, Anqi Cui, Zefeng Zhang, Ming Li
  • CNN and RNN encoder fed into RNN decoder; CNN: to learn topic distribution from sentence matixes, generates a topic vector
  • Context-in model: CNN vector is directly fed to decoder
  • Context-IO model: CNN vector fed to both hidden and output layer of decoder
  • Context-Attention model: attention computed from context at each decoder input
  • Trained on QA pairs with categories, and on twitter style chat
  • Shorter sentences have lower perplexity, but overall results look good

Layer Normalization [n-c]

Jimmy Lei Ba, Jamie Ryan Kiros, Geoffrey E. Hinton
  • Batch norm normalizes the summed inputs at each neuron, leads to faster convergence, and serves as a regularizer as well, but it's hard to apply to RNN
  • Layer normalization: computes statistics over all hidden units in the same layer, all neurons have same mean and variance terms

Sequence to Backward and Forward Sequences: A Content-Introducing Approach to Generative Short-Text Conversation[chat]

Lili Mou, Yiping Song, Rui Yan, Ge Li, Lu Zhang, Zhi Jin
  • Based on pointwise mutual info, compute a key-word noun, generating reply from this word going backwards and then forwards with 2 different RNNs
  • When computing point wise mutual info (PMI) we penalize frequent words
  • In the backwards part words are reversed, and the forward RNN depends on the generated backward part

Temporal Attention Model for Neural Machine Translation [s2s]

Baskaran Sankaran, Haitao Mi, Yaser Al-Onaizan, Abe Ittycheriah
  • Memorize alignments temporally from previous timesteps to modulate the attention in subsequent timesteps (somewhat similar to memory networks)

Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation [s2s]

Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, Jeffrey Dean
  • 8 layer deep enc-dec with residual connections and attention from the bottom layer of decoder to top layer of encoder (which is bidirectional)
  • Low precision arithmetic (quantization at inference using int operations) for faster training+model and data parallelism
  • Deep LSTM improve performance but only if used with residual connections, input is added to the output of a layer and forms the input to the next layer
  • Wordpiece model cuts up words with a greedy algorithm, thus it has very few OOV (with 8-32k wordpieces), but it's faster than using only characters
  • Maximum likelihood is used together with expected reward RL objective function
  • Length normalization is needed so that beam search doesn't favor shorter results
  • Coverage penalty to favor results that fully cover the source sentence according to attention
  • RL refinement of the trained models barely improves human expression of translation quality

HyperNetworks [n-c]

David Ha, Andrew Dai, Quoc V. Le
  • Smaller network to generate the weigths for a larget network; both of them trained together with gradient descent
  • Inputs are embedding vectors that describe the entire weights of a given layer (this can also be learned during training)
  • They generate non-shared weigths for LSTM, meaning that weights can change between timesteps, that work better than standard LSTM
  • Static hypernetwork for CNN: for each layer input is a layer embedding; hypernetwork is a 2 layer linear network to project embedding to weight matrix
  • Thus it has to learn the projection weigths and biases and the embeddings which are less than the original CNN parameters
  • Dynamic hypernetwork for RNN: hypernetwork is an RNN, produces relaxed weight sharing (middle ground between hard and no weight sharing)
  • A linear network is also used in hyperRNN to project embeddings (the network entails similar theory as layer norm)
  • They applied it to a resnet, drastically reducing parameters with relaxed weight sharing
  • They compared hyperLSTM with layer norm LSTM together with recurrent dropout (similar results), and also applied layer norm to the hyper LSTM (best)

Can Active Memory Replace Attention? [s2s]

Łukasz Kaiser, Samy Bengio
Tensorflow (official)
  • Active memory can make parallel computations on the whole memory (like neural GPU), doesn't just focus on local stuff like attention
  • Memory operations with convolutions, and with CGRUs
  • After n-th CGRU there are the decoder attention CGRUs, which accumulate outputs and allow access to all outputs produced in steps before t.

Neural Machine Translation in Linear Time [s2s]

Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, Koray Kavukcuoglu
Tensorflow (official)TensorflowTensorflowTensorflow
  • ByteNet is a one-dimensional convolutional enc-dec (with dilation and residual blocks with ReLUs) for character-level language modelling
  • Stack decoder on top of the representation of the encoder preserving the temporal resolution, instead of passing a context vector
  • Dynamic unfolding: process different length sentences, with an estimated target length which is usually bigger than acutal target
  • ByteNet is good because it runs in linear time and preserves source sequence resolution

Two are Better than One: An Ensemble of Retrieval- and Generation-Based Dialog Systems [chat]

Yiping Song, Rui Yan, Xiang Li, Dongyan Zhao, Ming Zhang
  • Retrieve candidate found by IR, feed into generative model (biseq2seq) along with query, then generated reply is post-reranked with the retrieval model
  • Used crowd-sourcing to find out how relevant a query and a reply are instead of negative sampling approach
  • Achieves better results than either sub-model; the 2 sub-components are chosen about equal number of times during post-ranking

Neural Architecture search with Reinforcement Learning [n-c]

Barret Zoph, Quoc V. Le
  • Structure of a NN can be specified as a string representing the various parameters, thus a controller RNN could generate such strings
  • The generated network can be trained, and it's accuracy used as reward to compute the policy gradient to update the controller
  • RNN generates CNN, layer by layer and parameters one after another, then the CNN is trained until convergence and it's accuracy is used for REINFORCE algorithm, a policy gradient method… CNNs produced achieve state of the art on CIFAR-10
  • Add anchor points and set-selection attention to the RNN to propose skip connections (what previous layers to use as input to the current layer)
  • Produce a recurrent cell: as a tree of steps that take x_t and h_t-1 as inputs to produce h_t as output; the nodes can be labeled by functions and methods
  • The awesome recurrent cell produced is implemented in tensorflow as NASCell

Dialogue Learning With Human-In-The-Loop [chat]

Jiwei Li, Alexander H. Miller, Sumit Chopra, Marc’Aurelio Ranzato, Jason Weston
Torch (official)
  • Teacher gives feedback through RL to learner (bot trained with SL) in the context of QA
  • Both reward based numerical feedback (only receives it 50% of time when it's doing good) and forward prediction methods using textual feedback
  • Memory network: input is last utterance and memories (dialog context and KB); Memories are compared with query vector to select relevant ones
  • RL: policy is MemN2N model, state is dialog history, action space is set of answers
  • Reward based imitation (RBI): choose own model with 1-e probability, otherwise a random answer
  • REINFORCE: maximize expected cumulative reward of the episode
  • Forward prediction (FP): query and memory mapped to vector representation, and that together with an attention hop over all possible answers is combined to predict teach feedback; In online setting learner needs to update model using teacher textual feedback
  • RBI and FP work better with random exploration; all methods work a little worse than SL on babi tasks

Unsupervised Pretraining for sequence to sequence learning[s2s]

Prajit Ramachandran, Peter J. Liu, Quoc V. Le
  • Two language models are trained to initialize weights of an enc-dec model on source and target corpus
  • Only 1 LSTM layer, softmax of decoder and embeddins are pretrained, then the model is initilaized with these plus one more randomly init. LSTM layer
  • Additional losses added from pretraining objective to regularize the model to avoid overfitting on the small dataset
  • Residual connections from output of pretrained LSTM directly to softmax
  • Attention over the top and first layer; attention vector is passed to 2nd layer at each time step
  • Model gives much better results than baseline on low resource datasets
  • Only pretrain encoder is more important for summarization and only pretrain decoder is more important for MT tasks

Deep Active Learning for Dialogue Generation [chat]

Nabiha Asghar, Pascal Poupart, Xin Jiang, Hang Li
  • Offline supervised learning of seq2seq model, followed by online active learning
  • Train sequentially on Cornell then on chatlogs, then comes online AL with real users, and learn incrementally from their feedback at each dialog turn
  • Model generates K responses uding hamming-diverse beam search -> user selects best one or suggests another response, then it's backpropagated using XENT lostt and one-shot (really high learning rate) learning, to immidiately change the weights significantly
  • Diverse beam search penalizes similar beams
  • Trained to mimic differend moods from user training (only needs 100 interactions to train)

RUBER: An Unsupervised Method for Automatic Evaluation of Open-Domain Dialog Systems [chat]

Chongyang Tao, Lili Mou, Dongyan Zhao, Rui Yan
  • Unsupervised, thus easy to use; referenced metric comparing embedding similarity of ground truth and generated reply combined with unreferenced metric that uses a neural network scorer to measure the relatedness between generated reply and its query
  • Cosine distance between ground truth and reply using max and min word embeddings
  • Query and reply vector computed with BiGRU, and a score assigned to them by a NN which is trained with negative sampling, by showing it bad responses
  • 2 scores combined in differently: choosing the maximum does not work, but choosing the minimum or averaging the scores gives near human correlation

Adversarial Learning for Neural Dialogue Generation [chat]

Jiwei Li, Will Monroe, Tianlin Shi, Sebastien Jean, Alan Ritter, Dan Jurafsky
Torch (official)Tensorflow
  • Generator seq2seq model, and discriminator labels dialogs as human or machine generated
  • Quality of machine generated utterances measured by its ability to fool the discriminator
  • Output of discriminator used as reward to the generator using REINFORCE algorithm
  • Discriminative model is a binary classifier: input is dialog encoded into a vector using a HRED
  • Discriminator updated together with generator, using human generated dialog as positive example, and machine-generated dialog as negative example
  • Improve model with reward for every generation step: in order to distinguish word level rewards
  • Monte Carlo search: A partially decoded seq. is finished (sampled) 5 times and fed to discriminator->average score used as reward for the partially dec. seq.
  • Some fraction of responses generated are human so that the generator doesn't get lost, and gets positive rewards sometimes to go the right way
  • Remove short training examples, weighted learning rate based on tf-idf, penalizing word types that have already been generated
  • Adversarial evaluation labels dialogs as machine or human generated, model should achieve 50% accuracy if human and machine dialogs are the same
  • Adversarial success is the fraction of instances in which a model fools the evaluator, the difference between 1 and evaluator accuracy
  • Achieve higher adversarial success than MMI seq2seq models (MC better than vanilla reinforce)

Hierarchical Recurrent Attention Network for Response Generation [chat]

Chen Xing, Wei Wu, Yu Wu, Ming Zhou, Yalou Huang, Wei-Ying Ma
Theano (official)
  • BiGRU encodes words, and calculates attention over them, then the utterance vector is used as input to utterance level encoder (backward GRU, because more recent utterance is more important), and utterance level attention is calculated over the utterances to form the context vector
  • Word level attention depends on both the hidden states of the decoder and hidden states of utterance level encoder

A Copy-Augmented Sequence-to-Sequence Architecture Gives Good Performance on Task-Oriented Dialogue [chat]

Mihail Eric, Christopher D. Manning
  • Seq2seq with attention and soft copy, only copy from the source entities of the knowledge base
  • Inputs augmented with entity type features, append one-hot class vectors to word embeddings
  • Really simple network outperforming more complex architectures

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer [n-c]

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, Jeff Dean
  • Large models trained on huge datasets cost a lot computationally. Conditional computation increases model capacity with less costs: parts of network are active or inactive on a per-example basis; from thousands of networks chooses only a handful, where gating vector is not zero
  • MoE consists of many experts, each a feed-forward NN (same architecture) and a trainable gating network that selects a sparce combination of the experts
  • Apply MoE convolutionally between stacked LSTM layers, different experts become highly specialized based on syntax and semantics
  • Gating is based on softmax, but with added tunable gaussian noise and only selecting top k values
  • Problem is that b batch size shrinks to k*b/n if k experts choosen out of n. By distributing model to separate devices with separate batch updates but keeping the expert parameters shared we can factorize the size of the batch while updating the model synchronously
  • Apply MoE to all time steps of a previous LSTM layer convolutionally -> bigger batch size
  • Additional "importance" loss added to loss function of the model so that experts are equal -> coefficient of variation of the sum of batchwise gate values
  • Trained 2 LSTM layers with MoE between them; also tried hierarchical MoE, where each expert is a MoE as well
  • With same computational budget it achieves lower perplexity than simple LSTM models on language modelling
  • Also on WMT it achieves new state of the art with billions of parameters but similar training time as GNM

Generating High-Quality and Informative Conversation Responses with Sequence-to-Sequence Models [chat]

Louis Shao, Stephan Gouws, Denny Britz, Anna Goldie, Brian Strope, Ray Kurzweil
  • Target-side attention into decoder so it can keep track of what has been output so far, to generate longer coherent responses
  • This is memory-intensive, trade-off is the glimpse model, which interpolates between source-side-only attention on the encoder and source and target-side attention on the encoder and decoder, done with fixed-length glimpses from target side, and source + part of target seq. before the glimpse on encoder
  • Rerank beams segment by segment, injecting diversity early, and integrate sampling into beam-search making it stochastic
  • The model produces longer responses that are also more coherent, but for shorter responses they choose to fall back to the baseline without length norm.

A Knowledge-Grounded Neural Conversation Model [chat]

Marjan Ghazvininejad1, Chris Brockett, Ming-Wei Chang, Bill Dolan, Jianfeng Gao, Wen-tau Yih, Michel Galley
  • Condition responses based on conversation history and external facts (amazon, wikipedia) relevant to current context
  • NER is used for example to make a query to retrieve facts; these are fed into a fact encoder -> this summed with conversation encoder are fed into decoder
  • Fact encoder is similar to memory network, retrieves and weights facts based on user input and conversation history
  • Multitask learning: first task is conversational, pure enc-dec model trained; second task exposes the full model to facts as well; third task is similar to autoencoder, it uses facts for both encoders
  • Twitter dataset with mentions of local business, augmented with facts (foursquare tips): many contextually relevant facts -> filter them with tf-idf, retain 10 tips
  • They use beam search and N-best lists reranking based on MMI
  • The results are somewhat more diverse than baseline seq2seq

Batch Policy Gradient Methods for Improving Neural Conversation Models [chat]

Kirthevasan Kandasamy, Yoram Bachrach, Ryota Tomioka, Daniel Tarlow, David Carter
  • Reward only after agent has reached terminal state, aim is to find a policy (with gradient methods) that does well with respect to the data distribution
  • Value function to get the expected reward if we follow a stochastic policy
  • Action-value function: for the expected reward of taking an action at a state following a specific policy
  • Input encoder (state), output decoder (action, and reward triples from costumer service conversations, where reward is a quality score of the conversation
  • Use the convex combination of re-weighted future rewards for estimating the action-value function
  • Estimate the value function based on an LSTM parameterization (hidden state of bottom layer in enc-dec) of state representation, but constant estimation of value function gives almost the same results
  • 2 layer enc-dec, with batch RL; RL only changes top LSTM layer of decoder and softmax
  • Europarl dataset, bootstrap with more unlabeled data with MLE objective, then train on smaller labeled data with RL (works if RL and MLE have some overlap)

Hybrid Code Networks: practical and efficient end-to-end dialog control with supervised and reinforcement learning [chat]

Jason D. Williams, Kavosh Asadi, Geoffrey Zweig
Tensorflow (official)
  • RNN; domain-specific software and action templates (text or API call) and a conventional entity extraction module
  • Utterance is featurized in 1.bag of words, 2.embedding, 3.entity extraction and these are passed to RNN, output is action template
  • Best results on babi dialog tasks 5 and 6, and other toy examples…

Learning Conversational Systems that Interleave Task and Non-Task Content [chat]

Zhou Yu, Alan W Black, Alexander I. Rudnicky
Python (official)
  • Utterance is fed into and non-task response generator and language understanding module which encodes it for task response generator; then a response selection policy (using RL) chooses between the all of the candidates from the 2 generators, and the response is fed back into the system
  • Language understanding module: based on simple key-word matching because user responses are usually yes / no
  • Task response generator: 8 pre-defined templates about movie promotion considering the info from language module
  • Non-task response generator: 3 methods used (no RNN), keyword retrieval, skip-thought vector, statistical templates based conversation strategies
  • Q-learning used to optimize towards long-term coherence, consistency, variety and continuity
  • Constraints based on conversational data and expert rules applied to reduce number of states
  • Reward function based on 4 weighted metrics: turn-level appropriatness, conversation depth, information gain, conversation length

Learning Discourse-level Diversity for Neural Dialog Models using Conditional Variational Autoencoders [chat]

Tiancheng Zhao, Ran Zhao, Maxine Eskenazi
Tensorflow (official)
  • Conversation representation with 3 random variables: dialog context c, response utterance x, latent varaible z, which captures latent distr. over valid responses
  • Generative process: sample a latent variable z from the prior network -> generate x through response decoder
  • Training done with stochastic gradient variational bayes that maximizes the variational lower bound of conditional log likelihood of p(x|z,c)
  • Utterance encoder is BiRNN, context enc. and response dec. is 1-layer GRU; samples of z obtained by the recognition (training) or the prior network (testing)
  • Easier to train CVAE with explicitly extracted discourse features y (dialog acts ex.) -> this is the knowledge-guided CVAE, x relying on c,z,y; and y relies on c and z.
  • Tackling vanishing latent variable problem with bag-of-word loss (decoder has to generate a bag of words representation as well through an MLP)
  • Better than a VHRED baseline, more diverse responses; latent variable is correlated with dialog acts and response length

Emotional Chatting Machine: Emotional Conversation Generation with Internal and External Memory [chat]

Hao Zhou, Minlie Huang, Tianyang Zhang, Xiaoyan Zhu, Bing Liu
  • Seq2seq framework with emotion category embedding, internal implicit emotion memory, and external explicit memory
  • P(Y|X,e), where e is one of 6 emotion categories, we embed this and feed into decoder
  • Internal memory: capture emotion dynamics, each emotion is decaying during decoding, because it is read and written (by the GRU) at each step to the memory
  • External memory: the model can choose between words from a generic or an emotion vocab (separate softmaxes)
  • Regularization: emotion state in internal memory should decay to zero at the end of decoding; there is another term for constraining the external memory
  • Emotion category annotation obtained with bi-lstm emotion classifier (62.3% acc.)
  • ECM model obtains better perplexity (without external memory) and emotional accuracy and better human rating than base seq2seq

Not All Dialogues are Created Equal: Instance Weighting for Neural Conversational Models [chat]

Pierre Lison, Serge Bibauw
  • Associate each context and response pair with a numerical weight that reflects the quality, then these weights are included in the loss function of a neural model
  • Weights computed via a neural model learned from dialog data, positive (high quality) and negative examples (quality can be coherent and interesting response)
  • Weighting model has 2 sub-networks for context and response tokens, then it produces an embedding together with other quality features, and then a score
  • Tf-idf and dual encoder models are investigated with the new loss (retrieval models), dual encoder with weighting loss produced best results on recall
  • Open subtitles dataset used, lemmatised and pos-tagged, and names replaced by NER with tokens

Chat Detection in an Intelligent Assistant: Combining Task-oriented and Non-task-oriented Spoken Dialogue Systems [chat]

Satoshi Akasaki, Nobuhiro Kaji
  • Decide whether a dialog act is chat or non-chat (task) in order to better integrate chat generators like seq2seq into intelligent assistants
  • They constructed a dataset from yahoo voice with 15k utterances labeled as chat or non-chat (many sentences can be both)
  • Two binary classifiers used
    • SVM using character and word n-gram features and skip-gram word embeddings
    • CNN with word embeddings pre-training
    • Character-based tweet and query GRU enchances these 2 classifiers by training on twitter and yahoo search queries (concatenated as vector)
  • SVM+embed+tweet+queryGRU performs the best, 87.5% F1 score.

Convolutional Sequence to Sequence Learning [s2s]

Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, Yann N. Dauphin
Torch (official)TensorflowChainer
  • Convolutions can be well parallelized, and conv layers create hierarchical representations
  • Embed inputs into matrix, and concatenate them with a position embedding to maintain order; proceed similarly with outputs
  • Convolutional block structure: each layer contains 1D conv and a non-linearity; 6 blocks with kernel width 5 mean that the input field consists of 25 elements
  • Output is twice the size of input, but gate linear units are used to reduce it back to same size
  • Residual connections from input of each convolution to the output of the block; also pad the input at each layer in encoder network
  • Multi-step attention: for each decoder layer combine the current decoder state with an embedding of previous target element, and then compute dot product between this and each output of the last encoder block
  • Conditional input to current decoder layer is an attention weighted sum of the encoder outputs and input element embeddings; this is added to output of corresponding decoder layer to get final predictions; this considers which words we previously attended to and can be seen as attention with multiple hops
  • Normalization by scaling conditional inputs by the number of vectors, and scale gradient for the encoder layers by the number of attention mechanisms, and apply dropout to embedding, decoder outputs and to the input of the convolutional blocks
  • Datasets are WMT translation -> better BLEU results than GNMT
  • Grid search over kernel width and encoder/decoder layer depth shows that a narrow kernel and a deep network is the best

A Conditional Variational Framework for Dialog Generation[chat]

Xiaoyu Shen, Hui Su, Yanran Li, Wenjie Li, Shuzi Niu, Yang Zhao, Akiko Aizawa, Guoping Long
  • Generate next response based on dialog context(modeled separately for both speakers), stochastical latent variable and an external label
  • The model is HRED with separated context models: encoder RNN for tokens, and 2 status RNN for each speaker utterance
  • Variational auto-encoders used conditioned on context concatenation provided by SPHRED and an additional class label (ex: generic or non-generic response)
  • The class label can be unkown in which case a classifier is implemented to first predict it from the context vector
  • VAE produces the latent variable for the HRED and the posterior distribution of latent variable approximated based on context and class label
  • Dataset used is ubuntu dialog corpus, gradually more and more focus on latent variable as the training goes on; results are similar to VHRED

Adversarial Generation of Natural Language [n-c]

Sai Rajeswar, Sandeep Subramanian, Francis Dutil, Christopher Pal, Aaron Courville
  • With GANs sequence-level training objective can be incorporated together with curriculum learning on the length of the sequence using a generator (G) and discriminator (D) network
  • Regular GAN objective function is hard to train(unstable/vanishing gradients), which wassterstein GANs (WGAN) alleviate -> better at assuring that the (D) objective won't only exploit the difference between the sparsity of 1-hot vectors, and continuous output predictions
  • (G) is provided with a noise matrix at each time step, transforming it into a sequence of probability distributions over the vocab
  • (G) and (D) model variants:
    • (G) LSTM with peephole connection between output and previous hidden state; (D) LSTM uses binary logistic regression on last hidden state
    • Same 1-D convolutional residual blocks for both (G) and (D)
  • For evaluation of GANs the likelihood of the sample under the true data distribution is used. Datasets are toy CFG, PCFG and Chinese poetry and Penn treebank
  • Conditional generation is also explored with question and positive/negative sentiment attributes added as feature vectors to each conv layer (no LSTM)

ParlAI: A Dialog Research Software Platform [chat]

Alexander H. Miller, Will Feng, Adam Fisch, Jiasen Lu, Dhruv Batra, Antoine Bordes, Devi Parikh, Jason Weston
  • Software platform that provides a unified framework for training and testing dialog models with over 20 tasks (datasets) supported and example models
  • World, agent and teacher classes in python to handle the training of a dialog model in some environment
  • 5 Task categories: QA, Sentence Completion, Goal-Oriented Dialog, Chit-Chat, Visual Dialog
  • Models: HRED, IR, Memory NN, seq2seq
  • Seamless integration with mechanical turk for data collection, training and evaluation

Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols [chat]

Serhii Havrylov, Ivan Titov
  • There is a sampled image and some distracting images -> sender agent has to formulate NL message such that it helps the receiver to choose the sampled image
  • Sender (LSTM) only sees sampled image, while receiver (LSTM) sees all images and the message which is a sequence of symbols (strings)
  • Learning by using straight-through Gumbel-softmax estimators which is more efficient than RL methods
  • Sender's inputs are features extracted by a CNN from the target image; it has to sample one token at a time from the categorical distribution
  • The sampling sender agent is non-differentiable thus RL has to be used like the REINFORCE algorithm upgraded with a categorical distribution with continuous relaxation obtained from the Gumbel-softmax distribution; as time goes on samples from the distribution are becoming more one-hot encoded
  • In the forward pass the relaxation is discretized so that it resembles NL, and in the backward pass we use the gradient of continuous relaxation
  • Images from MS COCO dataset, output of the relu7 layer of VGG used; REINFORCE achieves 87% success, while GS-ST achieves 97%
  • Inspecting message symbols shows that a hierarchical language emerged by describing categories, but forcing a language model with KL halves the success rate

Depthwise Separable Convolutions for Neural Machine Translation [s2s]

Łukasz Kaiser, Aidan N. Gomez, François Chollet
Tensorflow (official)
  • SliceNet inspired by Xception network based on depthwise separable convolution layers with residual connections applied to MT tasks
  • Depthwise conv is a spacial conv performed independently over every input channel followed by pointwise conv projecting to a new channel space
  • DSCNN uses much less parameters than regular CNN, super-SC uses even less parameters by splitting the input into groups along the depth then apply separable conv to each group separately, and then concatenate the results along the depth
  • With DSCNN we can use larger filter windows, thus we don't have to use filter dilation
  • Autoregressive decoder produces new output prediction given encoded input and encoding of all existing predicted outputs (not just previous!)
  • Both encoders and decoder use convolutional modules composed of stacking conv steps with residual connection; one conv-step consists of a ReLU->SepConv->layer norm.
  • Attending is performed by adding a timing signal to the targets (encoding positional info) then doing 2 conv-steps and then attending to source by computing feature vector similarities between source and target
  • Beats GNMT in WMT english to german by 0.1 BLEU

Generative Encoder-Decoder Models for Task-Oriented Spoken Dialog Systems with Chatting Capability [chat]

Tiancheng Zhao, Allen Lu, Kyusong Lee, Maxine Eskenazi
  • Framework: 1. entity indexing, 2. slot-value independent enc-dec, 3. utterance lexicalization by replacing special tokens with NL
  • NER is used to detect entities and convert them to indexes, then enc-dec predicts next utterance using KB query
  • Each utterance encoded by a CNN, then enc LSTM reads the utterances, and dec LSTM generates output based on attention over enc LSTM states as well
  • Task oriented dialog dataset augmented by inserting utterance-response pairs from chit-chat style dataset; system first answers the chit-chat style question and then repeats it's previous task oriented question
  • System tested on bus schedule dataset, achieves aroun 70-80% success rate for finding a good bus schedule between locations

Personalization in Goal-Oriented Dialog [chat]

Chaitanya K. Joshi​, Fei Mi​, Boi Faltings
Tensorflow (official)
  • Make a restaurant reservation that is personalized to the user's attributes / preferences with memory networks
  • Simulated dialogs made on restaurant reservation task with api calls, but with added personalization attribute values before first dialog turn
  • Augment bAbI tasks with personalization of the bot's language style based on user's gender and age, adding 6 patterns of same dialog for different styles
  • Other personalization is based on vegetarian / non-vegetarian, adding that to restaurant types KB
  • Rule-based (which should perform 100%), supervised embedding and memory network models investigated in retrieval style dialog
  • Supervised embeddings were very bad, while memory networks were almost 100% for first two tasks, but only 60% for KB tasks
  • first 5 original bAbI tasks:

Attention Is All You Need [s2s]

Ashish Vaswan, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin
Tensorflow (official)TensorflowPytorchChainer
  • Encoder makes a continuous representation on symbols, then autoregressive decoder takes these and generates one output at a time, consuming previously generated outputs as inputs for the next one
  • Transformer is based on this using stacked self-attention and position-wise, fully connected layers for both enc and dec
  • Encoder:
    • 6 identical layers, each made up of two sub-layers
    • First is a multi-head self-attention mechanism and second is a feed-forward network
    • Residual connection around each sub-layer followed by layer norm.
  • Decoder:
    • 6 identical layers, with same sub-layers as encoder plus a third one
    • Third sub-layer performs multi-head attention over the output of the encoder stack
    • Self-attention is masked compared to encoder to prevent positions from attending to subsequent positions
  • Scaled (to counteract large vector dimensions) dot-product attention is used over set of queries, keys and values
  • Multi-head attention is used by applying scaled dot-product attention to different linear mappings of queries, keys and values; the outputs from the attention layers are concatenated and once again projected
  • Transformer attention:
    • In enc-dec attention layers (middle) the queries come from previous decoder layer, and the memory keys and values from output of encoder
    • In encoder self-attention layers keys, values and queries all come from output of previous layer in the enc
    • Decoder self-attention layers allow each position in the dec to attend to all positions in the dec up to and including that position
  • Position wise feed forward layers are similar to two convolutions with kernel size 1, parameters are shared between positions, but are different between layers
  • Positional encodings:
    • Added to the input and output embeddings at the bottom of encoder and decoder stacks
    • Sine and cosine functions of different frequencies based on position
  • Trained on WMT english to german and english to french, using word-piece vocab; outperforms previous state-of-the-art ensemble models
  • Dropout is applied to the output of each sub-layer and do the sums of the embeddings and the positional encodings, and dropping out attention weights

One Model To Learn Them All [s2s]

Łukasz Kaiser, Aidan N. Gomez, Noam Shazeer, Ashish Vaswani, Niki Parmar, Llion Jones, Jakob Uszkoreit
Tensorflow (official)
  • MultiModel trained simultaneously on WMT, ImageNet, WSJ speech and parsing corpus and COCO image captioning dataset
  • Modality nets convert images, speech, text and categorical data into joint representation space which is variable size
  • 3 basic types of blocks for encoder and decoder:
    • Convolutional blocks: 4 x (ReLU over inputs -> depthwise separable CNN -> layer norm.) with residual connections and droput at the end of block
    • Attention blocks: multi-head dot product mechanism with source and target inputs
      • Target (composed with timing signal from sine and cosine curves and mixed using 2 conv blocks) is self-attended
      • Source passed through 2 pointwise convolutions to generate memory keys and values
      • Finally the query keys, memory keys and values are used to apply attention between self-attended target and source
    • Mixture-of-Experts blocks: feed-forward networks (experts) and a trainable gating network selecting a sparse combination of experts
  • Encoder encodes inputs to encoded inputs, which together with previously computed outputs are passed to I/O mixer which computes encoded outputs, which together with encoded inputs are passed to autoregressive (left-padded) decoder to generate outputs
  • Modality nets:
    • Different tasks from same domain share modality nets; a special token embedding is learnt for differentiating between tasks
    • Language mod net: tokenized using same vocab of 8k sub-word units
    • Image mod net: number of residual convolutional steps applied
    • Category mod net: output modality by applying conv steps to get the 1D category
    • Audio mod net: 1D waveform or 2D spectogram transformed with 8 residual convolution blocks
  • MultiModel achieves 10-20% percent lower performance from state-of-the-art on WMT and ImageNet
  • Accuracy increases slightly for all tasks when trained jointly on 8 tasks compared to training separately on each task
  • Excluding any of the 3 types of blocks reduces performance on all tasks

Natural Language Does Not Emerge ‘Naturally’ in Multi-Agent Dialog [chat]

Satwik Kottur, José M.F. Moura, Stefan Lee, Dhruv Batra
PyTorch (official)
  • Task & talk game: between Q-bot and A-bot in a world with 64 unique objects
  • A-bot sees an object that Q-bot doesn't and Q-bot has to discover two attributes of this object through dialog, then Q-bot guesses the attributes and both bots receive reward on how good the guess was (RL)
  • Model Q-bot and A-bot as operating under stochastic policies which are LSTM-based models
  • Q-bot has a listener LSTM encoder, a speaker fully connected layer and a prediction LSTM network
  • Dialog is done through speaker networks and listener LSTMs of Q-bot and A-bot, the final prediction LSTM is based on previous state and the task encoding
  • REINFORCE algorithm used, estimate reward expectation by sample averages (environment, dialog)
  • Agents usually invent a language to solve the game near perfectly, but this language is not compositional, interpretable or natural
    • Overcomplete vocabularies: A-bot learns to convey each attribute with separate symbol, generalizes very badly
    • Attribute-value vocabulary: limiting the vocab leads to better generalization but still doesn't yield compositionality
    • Memoryless A-bot: resetting the state of A-bot at each dialog round and further reducing it's vocab leads to a consistent and compositional language

Deal or No Deal? End-to-End Learning for Negotiation Dialogues[chat]

Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, Dhruv Batra
PyTorch (official)
  • Dataset constructed (6k dialogs) with Mturk task, where humans are given items to negotiate, who gets what
    • The items' values are not the same for the 2 participants
    • Same task for agents, where they get a reward if they reach an agreement
  • First a seq2seq is trained to generate the dialog given the input items and values (goal)
    • There is an input encoder RNN, a dialog generator RNN
    • After dialog is generated an output RNN predicts the output agreement (who gets what), based on input goal and dialog
  • After pretraining with SL, self-play is used, but one agent is fixed, since training both led to divergence from human language
    • During RL, the dialog generator acts both as encoder for the other agent's utterance and as response generator
  • Rollout is used for a better decoding tactic
    • Agents rollout several utterances until the end of the dialog, and select the utterance that gets highest reward
  • After each RL update an SL update is made
  • Evaluation with humans show that the simple SL model learns to agree more times, but doesn't get an optimal solution as many times as the RL model using rollouts
    • RL+ROLLOUTS negotiates harder, resulting in more turns
  • Evaluation with an SL agent is much better than with humans, meaning that the RL agent overfitted to the SL agent scenario
  • They take inspiration from alphago, and propose to scale tree search to dialog modelling as future work

DeepProbe: Information Directed Sequence Understanding and Chatbot Design via Recurrent Neural Networks [chat]

Zi Yin, Keng-hao Chang, Ruofei Zhang
  • Apply seq2seq model to rewrite user question into one that a reccomendation system understands and use seq2seq model to score and pick best candidates
  • Bidirectional encoder LSTM, and attention at top layer of decoder LSTM (we attend to top layer of encoder)
  • Entropy to measure the confidence of an agent in whether it should reccommend an item based on the previous dialog
  • They propose a greedy uncertainty-reduction algorithm to maximize expected information gain at each step based on mutual information and a set of questions that the chatbot can ask the user
  • By estimating the posterior distribution the model can rank returned items by the IR in order of relevance to the query
  • The chatbot either asks a sampled question or makes a reccommendation based on confidence from entropy value

Enterprise to Computer: Star Trek chatbot [chat]

Grishma Jena, Mansi Vashisht, Abheek Basu, Lyle Ungar, Joao Sedoc
  • The bot consists of two seq2seq models to handle star trek style (trained on star trek dataset) input and everyday conversations (trained on cornell dataset)
  • When confidence in a response is low rule-based outputs are used
  • Trained on twitter dataset to binary classify to choose which seq2seq model to use
  • Word graph algorithm is used to insert star trek specific words into responses -> this could cause ungrammatical sentences so consequently a bigram LM is used to select between candidate sentences
  • It's compared to the pandora rule-based bot, and achieves better coherence and star trek style scores

Domain Aware Neural Dialog System [chat]

Sajal Choudhary, Prerna Srivastava, Lyle Ungar, Joao Sedoc
  • Domain specific seq2seq followed by a re-ranker to predict the most likely response and domain combination (which is fed back into domain classifier)
  • Utterance is fed into domain classifier as well as into multiple separately trained domain specific seq2seq (with attention)
  • Domain classifier is composed of an SVM with logistic regression or an RNN with one-hot input vector representing subsequent domains in the conversation
  • Reddit dataset with 3 domain categories, and another model trained on twitter dataset for out of domain queries
  • Logistic regression over previous domain categories coupled with SVM performed the best, beating a simple seq2seq model

A Continuous Relaxation of Beam Search for End-to-end Training of Neural Sequence Models [s2s]

Kartik Goyal, Graham Neubig, Chris Dyer, Taylor Berg-Kirkpatrick
  • Models trained with max likelihood objective can't take into account the benefits from beam search, they yield better performance with greedy decoding
  • Hamming loss evaluated on output of beam search, but to make it continuous we approximate the beam search decoding
  • The approximation is achieved by relaxing the objective function with a parameter to become more and more like actual loss function based on beam decoding
  • Make decoding soft by approximating argmax by a temperature controlled softmax

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses [chat]

Ryan Lowe, Michael Noseworthy, Iulian V. Serban, Nicolas Angelard-Gontier, Yoshua Bengio, Joelle Pineau
  • Train a hierarchical RNN with a dataset of dialogues and corresponding human scores
  • Several models used to generate varied responses for this dataset
  • Encoder that learns vector representations of context, model response, reference response, then it computes dot product, * with learnt matrices
  • Model is trained to minimize squared error between prediction and human score
  • Model is pre-trained as a dialogue model (VHRED), sub-words and layer normalization used in encoder
  • ADEM (name of the model) correlates somewhat better with human judgment both at response and system levels
  • It can also generalize to new models, even if it was trained on only retrieval based models it can test a generative model

Cold Fusion: Training Seq2Seq Models Together with Language Models [s2s]

Anuroop Sriram, Heewoo Jun, Sanjeev Satheesh, Adam Coates
  • Previously a similar model, deep fusion used the combination of seq2seq and LM hidden states to form output, but these were trained separately, thus the decoder of the seq2seq also had to learn a language model specific to the training data, which makes it hard to transfer to other tasks
  • In cold fusion seq2seq model is trained together with a fixed pre-trained LM
  • A gating mechanism (neural network) chooses how to combine the LM logits and the seq2seq states to get the final prediction
  • They experimented on speech recognition task, and the LM was an RNN
  • Trained on search query database, achieved 12% word error rate, and then applied to movie subtitles achieved 28% world error rate, better than basic seq2seq
  • Further fine tuning / training on 10% of the movie subtitles gets the word error rate close to a basic seq2seq trained on 100% of movie subtitles dataset

Training RNNs as Fast as CNNs [n-c]

Tao Lei, Yu Zhang
PyTorch (official)
  • Simple Recurrent Unit that is 10x faster (same speed as CNN) than LSTM
  • They use skip-connections for computing the final output of an RNN, and dropout on the inputs
  • To make it less recurrent: drop connection between previous state and neural gates at current step (to compute current state we still use previous state, but this is only an element-wise computation)
  • Validated on question answering, language modelling and machine translation achieving similar accuracy as LSTM, but trained much faster

A Deep Reinforcement Learning Chatbot [chat]

Iulian V. Serban, Chinnadhurai Sankar, Mathieu Germain, Saizheng Zhang, Zhouhan Lin, Sandeep Subramanian, Taesup Kim, Michael Pieper, Sarath Chandar, Nan Rosemary Ke, Sai Rajeshwar, Alexandre de Brebisson, Jose M. R. Sotelo, Dendi Suhubdy, Vincent Michalski, Alexandre Nguyen, Joelle Pineau, Yoshua Bengio
  • An ensemble model that uses several machine learning models (trained separately on datasets, then jointly with reinforcement learning with user interactions) and few hand-crafted rules
  • Takes as input the dialogue history, and each model outputs a response from which the priority is choosen or if none, all models scored to select the best
  • Template-based models:
    • Alicebot (string-matching)
    • Elizabot (engaging using more questions)
    • Initiatorbot (asks hand-written starting questions)
    • Storybot (triggered by user, returns a story; non-conversational)
  • Knowledge-base based QA:
    • Evibot: forward question to amazon QA
    • BoWMovies: handles questions in the movie domain by recognizing entities and tags via string matching or word embeddings
  • Retrieval-based neural networks:
    • VHRED: candidate responses retrieved based on cosine similarity, then likelihood of each one is computed by VHREDs (trained on separate datasets)
    • SkipThought vector models: handles trigger phrases (with keywords); ensures that the bot follows the alexa prize rules (bot shouldn't state its opinion)
    • Dual encoder models: encoding dialog history and candidate responses and selecting the best
    • BoW: retrieve response with highest cosine similarity from several reddit and twitter topics
  • Retrieval-based logistic regression:
    • BoWEscapePlan: returns from a set of 35 generic responses based on a logistic classifier
  • Search Engine-based neural networks:
    • LSTM classifier: chooses responses from a set of search engine results (trained as binary classifier to choose relevant search snippets)
  • Generation-based neural networks:
    • GRUQuestionGenerator: generates question conditioned on dialog history (start of question is template-based)
  • Model selection policy:
    • Sequential decision making problem to satisfy long-term dialog with reinforcement learning (reward for each response)
    • Action-value function estimates expected return for a candidate response
    • Stochastic policy is a discrete distribution over actions based on a scoring function
    • A lot of input features to the scoring network: word embeddings, similarity metrics, PoS, dialog acts, bigram, generic response, etc…
  • Scoring model architectures:
    • Scoring model (policy) is a simple feed-forward neural network
    • Supervised pre-training with AMT labelers labeling dialogs spit out from the chatbot
  • Supervised learned reward: Predict the alexa user score with linear regression model for a dialog history and response, based on hand selected features
  • Learn the policy with off-policy REINFORCE: reward shaping, by giving 0 reward when a negative user response is detected, and RL reward otherwise
    • Also combine this RL with the learned reward model for automatic rewards
  • Off-policy reinforce has higher variance and lower bias, and supervised learned reward is the opposite
    • New method: trade-off between variance and bias with Q-learning an abstract discourse MDP (second figure below)
    • At each step there is a hierachical structure, with a discrete random variable at top, based on the sets of dialog act, user sentiment and genericness
    • Given this sample, the MDP samples a dialog history from a set, then the agent chooses an action accordint to its policy, after which there's a reward
    • Finally a variable representing the AMT score is sampled, and new discrete state is sampled according to the current one, and the action
  • Off-policy reinforce q-learning and supervised amt offer best alexa user score
    • q-learning selects responses from much riskier models (bowfact, reddit), than supervised amt (alicebot,elizabot)
    • Off-policy reinforce can hold the longest dialog, and offers best user score for long dialogs
    • Based on final alexa user scores only q-learning achieve a higher score than the base evibot+alicebot heuristic
  • Q-learning has the highest topical coherency and topic specifity

Challenging Neural Dialogue Models with Natural Data: Memory Networks Fail on Incremental Phenomena [chat]

Igor Shalyminov, Arash Eshghi, Oliver Lemon
  • Augmenting the babi task 1 with incremental dialog phenomena (hesitations, restarts and corrections)
  • Training memn2n on babi and testing it on augmented babi gives very bad performance
    • Training and testing on augmented babi gives better performance, especially with more data
    • Training on augmented babi and testing on normal babi gives 99% accuracy
  • Dynymic syntax and type theory with records framework used by the authors (rule-based)
    • To build word-by-word semantic representations
  • Gives 100% semantic accuracy on both babi and augmented babi

Flexible End-to-End Dialogue System for Knowledge Grounded Conversation [chat]

Wenya Zhu, Kaixiang Mo, Yu Zhang, Zhangbin Zhu, Xuezheng Peng, Qiang Yang
  • KB for music domain conversations, containing triplets of {subject, predicate, object}
  • GenDS:
    • Candidate retriever detects entities and retrieves a set of facts from KB
    • Message encoder encodes input message (transforms entities to their general types)
    • Reply decoder decodes this together with the retrieved facts
    • Knowledge gate is used to determine whether to generate common or knowledge words at each time step
  • Dynamic knowledge enquirer:
    • Generates knowledge words based on 3 scores (computed by MLPs)
      • Message matching score
      • Entity update score
      • Entity type update score
    • They depend on last generated words
  • GenDS achieves significantly better entity accuracy than baseline seq2seq

Edina: Building an Open Domain Socialbot with Self-dialogues[chat]

Ben Krause, Marco Damonte, Mihai Dobre, Daniel Duma, Joachim Fainberg, Federico Fancellu, Emmanuel Kahembwe, Jianpeng Cheng, Bonnie Webber
  • Data is collected as self-dialogues written by AMT workers
  • Edina converses on 3 topics (movies, music, sports), combining rule-based and machine learning methods
    • Rule-based component with templates (backs off to matching score) (16%)
    • Matching score retrieves an answer (with confidence score) (46%)
      • Using the current utterance, and candidate contexts and responses from AMT dialogs
      • Based on bag-of-words with IDF
    • Neural network, is the last option, if the other two fail (20%)
      • Pretrained on opensubtitles and finetuned on AMT self-dialogs (they let it overfit a bit)
    • Proactive component which steers the conversation with questions related to entities (16%)
  • Preprocessing with NER and user-modeling (user preferences caught by rules)
  • Confidence rating is important to know when not to select matchin score outputs for example

Interactive Policy Learning In End-to-End Trainable Task-Oriented Neural Dialog Models [chat]

Bing Liu, Ian Lane
  • Jointly optimize a dialog agent policy and the user simulator policy used to train it
  • Bootstrapping both agents with supervised learning on task-oriented corpora
    • Then further training them with a collaborative task-oriented goal
    • User simulator is given a goal to complete
    • Dialog agent attempts to estimate this goal and fulfill requests
    • Both receive a reward on the level of task completion
  • Dialog agent
    • Bi-directional LSTM to encode the utterance, previous agent output, and retrieved KB result encoding
    • Dialog acts as system actions based on LSTM state, and action sample with an MLP from this state
    • Belief tracker maintains and updates a probability distribution over candidate values for each goal slot
    • Dialog agent has KB component, and can issue API calls
      • API call with slot-type tokens can be replaced by corresponding values from belief tracker
    • Template-based NLG module to convert system action, slot values and KB entities to NL response
  • User simulator
    • State maintained in an LSTM, takes as input a sampled goal encoding, the previous user output, and current agent input
    • Informable (price range) and requestable slots (address)
  • RL policy gradient optimization
    • States are the LSTM user and agent states
    • Action space is finite and discrete for both the dialog agent and user simulator
      • Actions are not words themselves, but rather higher level
    • Turn-level reward based on the progress that the agent and user made in completing the task in that turn
    • Softmax policy is applied during training, and during evaluation only for the user to generate more diverse utterances
  • Dataset is DTSC2 with added API calls and corresponding KB results
  • Training iteratively the agent and the user simulator
  • RL training improves the task success rate significantly compared to supervised learning

Augmenting End-to-End Dialog Systems with Commonsense Knowledge [chat]

Tom Young, Erik Cambria, Iti Chaturvedi, Minlie Huang, Hao Zhou, Subham Biswas
  • Integrate commonsense knowledge into retrieval-based models
  • Dual LSTM used to encode context and response
    • In classical retrieval, compatibility is computed between the created vector representations with a learned weight matrix
  • Commonsense:
    • Made up of assertions, that contain a triplet <c1, r, c2>, where r is a relation between two concepts
    • Concepts are retrieved as n-grams from the message, and all corresponding assertions are searched
    • An LSTM is used to encode all the retrieved assertions
    • Match score between each encoded assertion and response is computed with a learned weight matrix
      • The score of the assertion with biggest score is added to the original compatibility function
  • Comparison with memory networks and a baseline using comparison based on supervised word embeddings instead of LSTM representations
  • Dataset is 2M twitter status response pairs
    • 1M positive responses (ground truths)
    • For each status a negative response is sampled as a random different response from the training set
  • recall@k metric is somewhat better for tri-lstm than dual-lstm without commonsense

Dynamic Evaluation of Neural Sequence Models [s2s]

Ben Krause, Emmanuel Kahembwe, Iain Murray, Steve Renals
PyTorch (official)
  • Adapting the parameters of a model during the generation of a sequence at test time
    • To better capture the slightly different probability distribution
  • A long sequence is divided into a sequence of shorter sequences
    • After each short sequence segment a backpropagation step is carried out
    • The next sequence element is evaluated with the new parameters
    • This can be applied to sequence generation as well
    • Previous adaptation updates decay exponentially over time
  • Adaptation of hidden units is not direct but rather we adapt a matrix that is multiplied with the hidden units to get the new hidden units -> less parameters to adapt (achieves a little bit lower performance)
  • Achieves better perplexities for word and character level language modelling than state-of-the-art models
  • The longer the sequence the lower perplexity the model achieves

Neural Optimizer Search with Reinforcement Learning [n-c]

Irwan Bello, Barret Zoph, Vijay Vasudevan, Quoc V. Le
  • Domain specific language for optimizer
    • The two unary functions applied to two operands and the binary function applied to them
    • A lot of operands, unary and binary functions are accessible to the controller
    • Each operand can be further used until we get to the optimizer equation
  • The policy is an RNN, that selects the operands and operations sequentially
  • Since as the sequence unrolls new operands are created that can be subsequently selected, the softmax weights at each step are different
  • RNN trained to maximize validation performance of the update rules on a specified model
  • For speed increase the child network is a small convnet, and it is trained only for 5 epochs on CIFAR-10
  • PowerSign (discovered update rule)
    • The sign of the gradient and the moving average is multiplied together and a number is raised to this power and then multiplied with the gradient
  • AddSign (discovered update rule)
    • The sign of the gradient and the moving average is multiplied and added to a number, and then multiplied with the gradient
  • The found update rules offer a small perforamce advantage for larger networks as well

Multi-Task Learning for Speaker-Role Adaptation in Neural Conversation Models [chat]

Yi Luan, Chris Brockett, Bill Dolan, Jianfeng Gao, Michel Galley
  • The aim is to use non-conversational data to make a seq2seq model learn speaker roles (doctor, technician, etc.)
  • Multi-task learning approach:
    • Seq2seq learns conversational model based on large general population of speakers
    • Autoencoder utilizes large non-conversational personal data from target speakers
    • The decoder part of the two models are shared and jointly trained, so that the language model for generation is adepted to the target speaker
    • However one model can only be trained with 1 type of target speaker
      • So persona based model is tried as well which learns multiple speaker embeddings
  • Twitter data is used, and for the autoencoder 20 twitter users are selected and their posts without replies used as training data
  • Little bit better correlation of output responses with target speaker style than baseline

Emergent Translation in Multi-Agent Communication [s2s]

Jason Lee, Kyunghyun Cho, Jason Weston, Douwe Kiela
  • One agent sees an image and describes it in its language
    • Goal is to produce a description close to ground truth and to help other agent identify the target image
  • The other agent has to choose the correct image from several
  • Game played in both directions and agents trained jointly
  • Each agent has an image encoder, a native speaker module and a foreign language encoder
    • Image encoder is a CNN
    • Speaker module is an RNN taking image representation as initial state
    • Foreign language encoder is another RNN
  • The model achieves better performance if image encoder or native language encoder is pretrained and fixed during translation training
  • In conclusion, the achieve bleu scores are promising but far away from NMT baselines

DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset[chat]

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, Shuzi Niu
  • Dialog dataset available
  • Dialogs from english learning examples
  • Short dialogs on specific topics (average dialog is 8 turns)
  • Each utterance is labeled as one of four dialog acts: {inform, questions, directives, commisive}
  • Each utterance is labeled as one of 7 emotion categories: {anger, disgust, fear, happiness, sadness, surprise, other}
  • Dialogs usually follow some pattern along the four dialog acts, like question-inform bi-turn dialog flow
  • 83% of dialogs falls into the "other" emotion category
  • Some baseline dialog models, including retrieval and seq2seq based are evaluated (with emotion and dialog act included)

A Dual Encoder Sequence to Sequence Model for Open-Domain Dialogue Modeling [chat]

Sharath T. S., Shubhangi Tandon, Ryan Bauer
  • Use a history of dialog acts to get a global context for a seq2seq model
  • They also realize the problem of the loss function and try to tackle the incorporation of previous dialog turns
  • Conv-net pre-trained to predict dialog acts given input utterances, and the context encoder's hidden state is fed additionally to the decoder
  • Context encoder CNN pretrained on switchboard corpus
  • The seq2seq part is trained on cornell movie corpus
  • Seq2seq baseline where previous turns are simply concatenated performs worse than single-turn seq2seq
  • Proposed model outperforms baselines on qualitative analysis
    • Automatic evaluation is also given, based on dialog length, diversity and specifity
  • Choosing one among the least probable beams contributed to diversity of responses.

Adversarial Advantage Actor-Critic Model For Task-Completion Dialogue Policy Learning [chat]

Baolin Peng, Xiujun Li, Jianfeng Gao, Jingjing Liu, Yun-Nung Chen, Kam-Fai Wong
  • Train a discriminator to differentiate the dialog agent responses from human ones
    • Use output of discriminator as intrinsic reward (another critic in A2C framework)
    • Similar to Li et al. Adversarial neural dialog generation
  • Applied to movie ticket booking dialog system
    • Binary reward at the end of each dialog
  • As in other RL tasks, the gradients have high variance, so a baseline function is used
  • Alternating optimization is used between the discriminator reward and the A2C reward
  • User simulator is used, that has a goal, and informable and requestable slots
  • Both generator and discriminator are single layer neural networks
  • Adversarial A2C performs better than simple A2C

Customized Nonlinear Bandits for Online Response Selection in Neural Conversation Models [chat]

Bing Liu, Tong Yu, Ian Lane, Ole J. Mengshoel
  • Contextual nonlinear multi-armed retrieval bandit networks in an online setting (feedback from users)
    • 2 BiLSTMs for encoding context (sequence of utterances) and responses
    • The vectors produced serve as input to the contextual bandits
    • Binary reward is collected from user to update the parameters
  • Logistic regression Thompson sampling
    • Apply an approximation of the reward on selected dimensions of a second order polynomial feature space
      • Apply sigmoid function on cMu^T
  • Pretrained on supervised labeled data with cMu^T
  • The nonlinear bandit achieves better performance than a linear one, but the recall@1 is still pretty bad

Plan, Attend, Generate: Planning for Sequence-to-Sequence Models [s2s]

Francis Dutil, Caglar Gulcehre, Adam Trischler, Yoshua Bengio
  • Standard RNN seq2seq augmented with alignment planning and commitment vector
  • At each time-step an alignment plan matrix and a commitment plan vecor is computed
    • Matrix holds alignment for current and next k timesteps, conditioned on the previously predicted token and current context from encoder hidden states
    • Decoder receives the previous hidden state and predicted token and the context, which is a weighted sum of encoder annotations
      • The weights are from the first row of the alignment matrix
    • Commitment plan vector is a binary decision whether to follow existing alignment plan or to recompute it
      • Gumble-softmax trick to be differentiable
      • If it is 1, then update the alignment by interpolating with the previous alignment plan (mixing ratio determined by learned gate)
      • If it is 0, the previous alignment plan is used, by shifting the time-step
  • Penalty added to the loss function, so the model doesn't commit too often (update the alignment plan)
  • Better than a baseline seq2seq with attention on the task of finding eulerian circuits of graphs
    • And converges faster on QA and Char-level NMT

Unsupervised Machine Translation Using Monolingual Corpora Only [s2s]

Guillaume Lample, Ludovic Denoyer, Marc’Aurelio Ranzato
  • Build a common latent space between two languages with a single autoencoder seq2seq (with different vocab)
  • For translation the encoded sentence is decoded from the latent representation with the other language's decoder
  • Pretrain with unsupervised word-by-word monolingual translation
  • Constrain latent representations to have same distribution using an adversarial regularization term
    • Discriminator trained to identify the language of a given latent representation
    • Encoder trained to fool the discriminator
  • Train encoder and decoder by reconstructing a sentence given a (random) noisy version in the same language
    • Or by translating it to the other language (which is a noisy version by itself), and translating it back
  • Final loss function is a weighted sum of auto-encoding, cross-domain and adversarial loss
  • Evaluation done by translating from a language to another and then back to the original, and computing bleu score over original inputs and their reconstruction
  • Proposed model outperforms word-by-word unsupervised baseline
    • Performs on-par with supervised model trained on less parallel sentences
    • If trained on same amount of data supervised model far outperforms it however
  • Ablation study shows that the pretraining together with the cross-domain loss is the most important
    • After that comes the noising of sentences, since without noise the model merely learns to copy the input sentence

Classical Structured Prediction Losses for Sequence to Sequence Learning [s2s]

Sergey Edunov, Myle Ott, Michael Auli, David Grangier, Marc’Aurelio Ranzato
PyTorch (official)
  • Seq2seq models used are 1D convolutional and recurrent as well (with attention)
  • Token-level loss functions:
    • Token negative log likelihood (NLL)
    • Token NLL with label smoothing
  • Sequence-level loss functions:
    • Directly optimize sequence metrics, by computing a set of outputs and scoring them (Each word has the same loss)
      • One approach is to compute this set with beam search
      • Second strategy is sampling over model's output distribution
    • Sequence NLL: sum of token log probabilities, normalized by number of tokens
    • Risk: Minimize a cost function based on Bleu or Rogue
    • MultiMargin: difference between cost of pseudo-reference and candidate response, based on pre-softmax score
    • SoftmaxMargin: sequence NLL augmented with a cost inside the exponent
  • Combined objectives:
    • Weighted combination of a token and sequence level loss
    • Constrained combination, in which one of the two losses is used at any one time
      • If token loss is better than a baseline model, then train on the sequence loss
  • On NMT task, they achieve the best results with weighted combination of losses
  • Regenerating the candidate set for each input is much slower than pre-computing a set of candidates for each input, but achieves better overall BLEU
  • Beam search performs better than sampling
  • Increasing the candidate set size up to 16 increases the performance (after that it the performance decreases)

End-to-end Adversarial Learning for Generative Conversational Agents [chat]

Oswaldo Ludwig
Keras (official) Keras (official)
  • The work is closely related to Li et al.-s adversarial dialog agent
  • Context vector is used as input in the decoder at each time-step
    • This is made up of the entire LSTM encoded dialog history
    • A different LSTM is used to encode the so-far generated response
    • Decoder is a dense layer, predicting the likelihood of current token
    • Greedy decoding is used to generate token
  • Discriminator performs token-level binary classification
    • Whether current token is machine or human generated
    • Takes as input the token, the previous dialog utterances and the incomplete answer
      • These are processed by two different LSTMs from the encoder, and then fed into a dense layer
    • This way backpropagation can be used instead of reinforcement learning
  • Adversarial training starts with a pre-traind model using teacher forcing
  • Since whole dialogs are fed into the model, machine generated dialogs are also generated in each epoch by the model
  • Discriminator and generator are trained alternately
    • Discriminator is trained on the machine and human dialogs to distinguish between them
    • Then, generator is trained on the machine generated dialogs, minimizing the difference between the discriminator output and 1
    • After that, generator is also trained on only the human dialog dataset with standard cross-entropy loss
  • Dataset is from online english courses
  • Human and adversarial evaluation used (as in Li et al.)
    • Jaccard index between human and adversarial evaluation is 0.58
    • Adversarial training achieves a much better evaluation score

Fine Grained Knowledge Transfer for Personalized Task-oriented Dialogue Systems [chat]

Kaixiang Mo, Yu Zhang, Qiang Yang, Pascale Fung
  • Personalized decoder that can transfer phrase-level knowledge between users, while keeping personalized user info intact
    • With the use of a gate to switch between personal and shared phrases
  • The input to the model is the dialog history, where each word is labeled whether it is personal or general
  • First step of decoding is to compute the control gate based on the encoded sentence and the hidden state of shared and personal RNN
    • Then compute the next hidden states based on the gate output
    • Lastly generate the word based on one of the hidden states (given by control gate)
  • Each user is represented by a different decoder RNN
  • Shared and personal component trained together with RL (Reinforce algorithm)
    • Agent takes a combination of general and personal rewards
    • Personal rewards when user confirms suggestion of agent
    • General reward when user provides information about target task
    • Big general reward when system helps user finish target task
    • Negative general reward when user rejects to proceed
    • Shared params updated at each iteration, while personalized params updated based on data collected from corresponding user
  • This decoder is also integrated into the HRED model
  • Model tested in a coffee ordering task setting (very limited dataset)
  • Word-level transfer models perform better than sentence-level transfer

BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems [chat]

Zachary Lipton, Xiujun Li, Jianfeng Gao, Lihong Li, Faisal Ahmed, Li Deng
  • Problems is that q-learning methods never experience success because of huge action space in dialog
  • Dialog acts are utterances, with informed and requested slot-value pairs
  • State tracker contains a representation of the conversation history and database features
  • Domain is movie-booking with 39 actions (each slot has to actions, inform and request)
  • Per-turn penalty is given, so that dialog is as short as possible
  • In q-learning the optimal, but intractable Q-policy can be approximate with a learned neural network for example (DQN)
  • Bayes-by-backprop: the weights of a neural network are sampled from a gaussain distribution
    • Learn params by minimizing KL-divergence between variational approximation of the distribution and the posterior
  • Bayes-by-backprop Q-network (BBQN), integrates DQN with bayes-by-backprop networks
    • The authors use a simple MLP
    • BBQ network trained with q-learning, and monte carlo sampling is used over the frozen network to generate targets
    • Targets can also be computed with maximum a posterior (MAP) estimate
  • Variational Information Maximizing Exploration (VIME) can be used in BBQN to encourage unexplored state-action regions
  • Rule-based agents is used to pre-fill replay buffer so that BBQN sees some successful dialogs
  • Representing dialog state with a vector:
    • One-hot representations of act and slot corresponding to current user action
    • Act and slot corresponding to last agent action
    • A bag of slots corresponding to all previously filled slots
    • Knowledge base counts
  • BBQNs achieve much better performance than DQNs

End-to-End Optimization of Task-Oriented Dialogue Model with Deep Reinforcement Learning [chat]

Bing Liu, Gokhan Tür, Dilek Hakkani-Tür, Pararth Shah, Larry Heck
  • Dialogue-level LSTM takes in the encoding of current user utterance and encoding of previous system action
    • Produces a probability distribution over candidate values for each of tracked goals
    • Utterance-level LSTM is used to encode utterance
  • System action emitted based on current dialog state and retrieved info from KB (using a separate MLP-based policy network)
    • This is translated to NL using template-based generator
  • The authors first train the system in a supervised manner using task-oriented corpora.
    • Then user REINFORCE to further train the agent (reward at the end of dialog)
    • Penalty is given, to encourage shorter task completion time
  • RL clearly improves task success rate and accomplishes task in fewer turns than SL
    • Updating only the policy network results in less improvement than end-to-end RL

Are You Talking to Me? Reasoned Visual Dialog Generation through Adversarial Learning [chat]

Qi Wu, Peng Wang, Chunhua Shen, Ian Reid, Anton van den Hengel
  • In visual dialog (compared to visual QA), the agent has to provide reasoning to keep the conversation flowing (not just yes-no answers)
  • Propose a visual dialog model trained with adversarial learning
    • Discriminator has access to attention weigths which can be regarded as a form of reasoning
    • Monte carlo search used to compute word-level rewards (as in Li et al.)
  • Sequential co-attention is used to combine the image and dialog encodings (from CNN and LSTMs)
    • First utterance of the dialog is the image caption
    • Discriminator is also conditioned on image, question and dialog attention memories
      • And on the encoded question and generated answer (by generator)
    • Attention is always based on an input, and weighting based on other two features
      • First the question feature is used to attend to the image
      • Attended image features and question feature combined to attend to utterances
      • Then, the attended dialog features and attended image features guide the question attention
      • Finally, image attention is run again, guided by attended question and attended dialog
      • The three attended features are concatenated
      • This is fed to an LSTM to compute the probability of generating each token
  • Teacher forcing, whereby the generator is alternately updated based on discriminator reward and MLE loss
  • While model is mainly generation-based, evaluation is done in retrieval style: rank a set of responses
  • Generator pretrained on dataset, and discriminator pretrained as well
  • The proposed model performs better than previous state-of-the-art (which also used attention)
    • On recall@k as well as human evaluation

RubyStar: A Non-Task-Oriented Mixture Model Dialog System[chat]

Huiting Liu, Tao Lin, Hanfei Sun, Weijian Lin, Chih-Wei Chang, Teng Zhong, Alexander Rudnicky
  • Alexa participant, using an ensemble of rule-based, retrieval and generative models
  • NLU / Preprocessing:
    • Topic detection (6 classes)
    • Intent analysis (42 classes)
    • Entity linking links entities to entries in wikipedia
  • NLU followed by a strategies layer, which selects the reply generator based on preprocessing results
  • Order of priority: rule-based, knowledge-based, retrieval-based, generative (seq2seq)
    • Rule-based: intent templates, backstory, entity-based templates
    • If no match, the system tries to get a response from Evi (KB QA provided by Amazon)
    • If even this fails, then retrieval and generative models are employed
  • Context and topic history is tracked
  • Retrieval is based on recent twitter data
    • Randomly select from twitter posts related to recognized entities
  • Train an SVM classifier to rerank the candidate responses from seq2seq, based on engagement (binary)
  • Alexa user score is used to see how different modules affect quality of bot
    • Evi is used more in higher rated dialogs
  • Neural generative model is used the most
  • Mean score achieved is less than the MILAbot
    • Main problems are that the bot is not engaging or coherent

Examining Cooperation in Visual Dialog Models [chat]

Mircea Mironenco, Dana Kianfar, Ke Tran, Evangelos Kanoulas, Efstratios Gavves
Torch (official)
  • Q-bot and A-bot trained to guess an image through dialog
  • Intervening by replacing image pixels with random noise and caption words with random words
  • Intervening by replacing each token in Q-bot or A-bot utterance with a random one with some probability
  • Intervening by negating yes/no answers of A-bot, and see if Q-bot cooperates
  • Results:
    • Changing the caption with some probability correlates very well with the final percentile rank
    • Replacing image with random noise or changing the answer has no effect on performance
    • Replacing questions has a slightly bigger effect on performance (still minimal)
  • Basically Q-bot relies on the caption at the beginning of the dialog, so there is no cooperation between the bots

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm [n-c]

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, Demis Hassabis
  • Alphazero learns to play chess, shogi and go entirely by self-play at superhuman level
  • At each step an action vector is outputted representing the probability of each action based on the state, and a scalar value estimating the expected outcome
  • General-purpose Monte-Carlo tree search (MTCS) used
    • Each search consists of a series of simulated self-play games traversing the tree
  • At the end of each game a reward is given to the neural network policy
    • Parameters updated to minimize predicted outcome and actual outcome
    • And to maximize similarity of policy vector to search probabilities
  • Alphazero outperforms best-programs in each game
    • Trained for less than a day on 5000 TPUs
  • Alphazeros performance scales better with thinking time / move than Stockfish

Why Do Neural Dialog Systems Generate Short and Meaningless Replies? A Comparison between Dialog and Translation [chat]

Bolin Wei, Shuai Lu, Lili Mou, Hao Zhou, Pascal Poupart, Ge Li, Zhi Jin
  • Given a source sequence the conditional distribution of the target sequence has multiple plausible points
  • Mimicking this scenario in MT by shuffling source and target sentences
  • As the percentage of shuffled sentences grows in a dataset so does the bleu score, entropy and length of the output go down, achieving similar values as a dialog system

End-to-End Offline Goal-Oriented Dialog Policy Learning via Policy Gradient [chat]

Li Zhou, Kevin Small, Oleg Rokhlenko, Charles Elkan
  • Define reply generation as an MDP, parameterized by enc-dec
  • Combine on-policy with off-policy policy gradient
  • Two types of rewards:
    • Utterance-level reward captures quality of generated agent utterance compared with target from training data (with BLEU)
    • Dialog-level reward captures contribution of reply to achieving dialog goals
      • Negative reward if an API call is issued too early or too late, and positive reward for correct API call parameters
  • Reward shaping used to give rewards to intermediate actions
    • Approximate reward based on bleu; last action reward is the true reward
  • Off-policy policy gradient to help with exploration:
    • Maximize probability of actions in dataset weighted by importance sampling ratios
  • bAbI dialog task 6 dataset
    • they fed all KB restaurants and attributes into encoder
    • achieves slightly better performance than baseline

Peephole: Predicting Network Performance Before Training [n-c]

Boyang Deng, Junjie Yan, Dahua Lin
  • Encode layers of networks through an LSTM, and predict on validation data
  • Very few types of layers and architectures permitted (limited in scope)
  • The prediction is also conditioned on the number of epochs
    • Thus the entire learning curve can be predicted
  • CNN layer encoding:
    • A vector representing the type of layer, kernel width and height and number of channels
    • Similar to word embeddings the discrete vector is transformed to a continuous one
  • Final prediction is given by an MLP based on concatenation of last LSTM hidden state and the epoch index embedding
  • Block-based generation to acquire training samples
    • Reasonable architectures constructed based on simple heuristics and skeletons
    • They only used the accuracy of each network from last epoch for training
  • Trained on CIFAR-10 and MNIST
  • Much better results than previous approaches which relied on at least part of the learning curve
  • As the accuracy of a networks increases the correlation between predicted and actual accuracy becomes better

Progressive Neural Architecture Search [n-c]

Chenxi Liu, Barret Zoph, Jonathon Shlens, Wei Hua, Li-Jia Li, Li Fei-Fei, Alan Yuille, Jonathan Huang, Kevin Murphy
PyTorch (official) Tensorflow (official) Tensorflow (official)
  • Similar to A*, searching for more and more complex architectures, consisting of cells (these are learned)
    • Cells consist of blocks that do some convolution operation (8 possible choices) and addition over two input tensors
    • Set of possible inputs to a block is the set out all previous blocks in that cell, plus final block in previous, and the one before cell
  • Also train an RNN that can predict the reward for any model (validation performance)
  • Progressive learning, starting with cells of only 1 blocks
    • Use the reward predictor to predict the performance of networks with cells consisting of 2 blocks, and pick K most promising ones, compute actual reward and update predictor based on these, then iteratively use more and more blocks
  • The max number of blocks is five, so in total 1280 models are trained
  • The final best network achieves same performance on CIFAR-10 as the best one from NAS, but fewer networks had to be trained, and initial networks are smaller, so less training time
  • The best network on CIFAR-10 is ported to ImageNet and achieves same performance as previous state-of-the-arts

Deep Neuroevolution: Genetic Algorithms are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning [n-c]

Felipe Petroski Such, Vashisht Madhavan, Edoardo Conti, Joel Lehman, Kenneth O. Stanley, Jeff Clune
  • They used the simplest GA, without crossover, and selecting the parent based on elitism
  • They store each parameter vector as a seed, plus the list of random seeds that produce the series of mutations applied to the parameter
    • This is much more memory efficient, needed for the millions of params of deep nets
  • Novelty search: reward given to agents that perform behavior never seen before (this avoids local optima)
    • It doesn't get stuck where normal GA would get stuck in the image hard maze problem
  • Atari and human locomotion tasks used to benchmark
    • GA outperforms DQN and AC3 on some games, but performs worse on others
  • Simple random search also outperforms DQN and AC3 on a few games

Toward Continual Learning for Conversational Agents [chat]

Sungjin Lee
  • Hierarchical encoder used (even the word embeddings are constructed with a character RNN)
  • Actions are selected based on a projection from the state embedding to the action embeddings
  • With continual learning the total loss function over all tasks has to be minimized
    • Without access to prior tasks -> leads to catastrophic forgetting
    • To combat this a modified loss function is used, to preserve the weights learned from prior tasks
  • Small, in-house human-humand and human-computer datasets used
  • Weight transfer alone is not enough, the performance diminishes when switching between two tasks
    • The elastic loss however performs better

An Ensemble Model with Ranking for Social Dialogue [chat]

Ioannis Papaioannou, Amanda Cercas Curry, Jose L. Part, Igor Shalyminov, Xinnuo Xu, Yanchao Yu, Ondrej Dušek, Verena Rieser, Oliver Lemon
  • Rule-based bots:
    • Persona, eliza, weatherbot
  • Retrieval bots:
    • Newsbot, factbot, evi
  • Response selection in 3 steps:
    • Bot prio list
    • Contextual prio: newsbot is prioritized if it stays on topic
    • Ranking if no default bots fired
  • They experimented with data-driven bots, which weren't included in final system (lol)
  • Ranker:
    • Hand-egineered: coherence, flow, questions, same topic, dullness, sentiment polarity
    • Linear classifer, based on n-grams, and dialog features

Personalizing Dialogue Agents: I have a dog, do you have pets too? [chat]

Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, Jason Weston
  • Dialog dataset constructed by crowdworkers, containing about 160k utterances
  • Each dialog consists of two specific personas finding out information about each other
  • Revised personas are constructed that are similar to original persona, so that models don't simply learn to copy text from the persona into the dialog
  • 4 training scenarios:
    • Conditioning on no persona, conditioning on one of the personas, and conditioning on both
  • Both ranking and generative models are tested
  • Seq2seq model augmented with a memory network that encodes the profile sentences
    • During decoding the decoder attends to the profile representation (picture below)
  • Results:
    • Most models have better hits@1 if persona info is given during training
    • Revised personas perform worse, since word overlap is more rare, thus it is a harder problem
    • Based on human evaluation ranking models perform better than generative ones, and persona helps a bit
      • Also a ranking system trained on opensubtitles achieves much lower performance
      • Human evaluation has very high variance

Building a Conversational Agent Overnight with Dialogue Self-Play [chat]

Pararth Shah, Dilek Hakkani-Tur, Gokhan Tür, Abhinav Rastogi, Ankur Bapna, Neha Nayak, Larry Heck
  • Build templates/outlines for goal-oriented dialogue by letting two agents converse with discrete actions
  • The task-specific knowledge is left to the developer
    • In this work a database querying task is used
  • Two-step process: map the task specification to a set of dialogue outlines, then map each outline to NL
    • Outlines have annotations, consisting of dialog act and slot-value map
    • To generate outline a scenario is sampled consisting of user profile and user goals (with slots)
      • User profile captures verbosity, and other task independent characteristics
    • Only the annotations are generated which map to a template utterance (rule-based)
  • User simulator agent and system agent is used, which is a finite state machine
  • To map outline to NL crowd workers are used (multiple paraphrases of the same outline)
  • Second round of crowdsourcing used to validate the written utterances
  • Thus a high-quality annotated goal-oriented dataset is constructed

Topic-based Evaluation for Conversational Bots [chat]

Fenfei Guo, Angeliki Metallinou, Chandra Khatri, Anirudh Raju, Anu Venkatesh, Ashwin Ram
  • Topic-breadth and topic-depth
    • Bot should be able to converse on a variety of topics and it should sustain long and coherent conversations on given topics
  • Deep average networks (DAN) used to train a topic classifier
    • DAN extended with topic attention table, learning topic-word weights across the vocab, to detect topic-specific keywords
    • It is trained on internal (big) data annotated with 20-50 topics (1 special category for chit-chat or non-topical utterances)
  • Topic-based metrics:
    • Depth: the average number of turns on topics
    • Breadth: histogram based on number of turns the bot talked about different topics across all dialogs
      • Topic-specific keywords coverage: number of distinct keywords from DAN, on a given topic (more is better)
  • Topic depth shows almost the same correlation as user rating with response error rate (amazon manual utterance evaluation)
    • Much higher correlation than what avg. dialog length has
  • Topic breadth correlates much worse, however this is expected, it is more complementary to user ratings (to eliminate repetitiveness)

Improving Variational Encoder-Decoders in Dialogue Generation[chat]

Xiaoyu Shen, Hui Su, Shuzi Niu, Vera Demberg
  • A separate CVAE and AE is used, where the CVAE generates the input latent variables to the AE decoder (modelled with RNN)
  • RNN encoder extracts the corresponding latent variable target for each turn, based on which a CVAE is trained to reconstruct it through context-dependent Gaussian noise
  • The CVAE replaces an AED (adversarial enc-dec), thus alternating is needed between the AE phase and the CVAE phase (see images)
    • In CVAE phase a sample is obtained from the AE by transforming dialog context into continuous embedding and is used as target for max likelihood training (RNN encoder is fixed during this phase because it is from AE)
    • In AE phase an utterance is encoded to continuous latent variable, and a corresponding one is sampled from CVAE posterior distribution
  • KL divergence constraint added to RNN encoder in AE
  • Scheduled sampling is used to go from ground truth latent variable to noisier one, produced by CVAE
  • Trained on dailydialog, human evaluation gives more fluency than other VAE models

A Hierarchical Latent Structure for Variational Conversation Modeling [chat]

Yookoon Park, Jaemin Cho, Gunhee Kim
  • They address the degeneration problem of VAE-s: A powerful modell like RNN learns to ignore the latent variables
  • VHCR model proposed, with a hierarchical latent structure, and an utterance drop regularization technique
  • The ignorance of the latent variable can be shown using the KL-divergence term in the loss function, which falls to zero
  • Another problem is the data sparsity: if conditioned on context, there exist very few targets to the same context
    • Therefore hierarchical models can overfit to training data, without using the latent variable
  • In VHCR global latent variable is used along with the utterance level latent variables
    • Context and decoder RNN is conditioned on global latent variable as well
    • Utterance latent variable is conditioned on global l.v.
    • For inference of global latent variable bidirectional RNN is used over the utterance vectors generated by encoder RNN
  • With these latent variables the decoder still learns to ignore them, thus utterance drop is used
    • The utterance encoder vector is randomly replaced by an unkown vector
  • With these additions VHCR achieves much higher KL-divergence
  • Cornell, and Ubuntu corpus used for training (truncated utterances longer than 30 words)
  • With automatic metrics it is shown that VHCR balances better the KL-divergence term and the NLL
  • They show that the global latent variable controls tone, and overall content of the conversation, and the utterance latent variable is a more fine-grained control in response generation (however, this is based only on a few questionable examples)

Unsupervised Discrete Sentence Representation Learning for Interpretable Neural Dialog Generation [chat]

Tiancheng Zhao, Kyusong Lee, Maxine Eskenazi
PyTorch (official)
  • VAE based approach, but latent variable is a set discrete variables
    • This latent variables should capture salient features about the response, and be independent of the context
  • Recognition network to map a sentence to the latent variable z, and the generator network defines the learning signals use to train z
    • Recognition network does not depend on context!
    • Recognition and generator network form a VAE over the response (DI-VAE)
      • Because of the known issues of VAE, they modify the loss function to also optimize mutual information, which is similar to adversarial auto-encoders
    • Another model is using the skip-thought model: discrete variational skip thought (DI-VST)
      • Recognition is the same, but here two RNNs used to predict previous and next sentence
  • Additionaly there is an encoder-decoder network, and a policy network
    • This is used to encode the context and generate the response using samples from the VAE
    • Policy network trained to predict aggregated posterior from context
    • An additional loss is used based on the recognition network to penalize the decoder is its generated responses don't reflect the attributes in the latent variable (LAED)
      • For this a relaxation method is used: weight the word embeddings of the vocab with the probability prediction by the decoder, because otherwise it would be discrete (1 word at each step)
  • Using multiple small latent variables is better than using one large, according to perplexity, and the mutual information metrics
  • DI-VST is better at learning dialog acts and emotions through the latent variables on DailyDialog
    • Although homogenity is still pretty low (0.34 and 0.12)
  • When LAED is added, the attribute accuracy of the model increases, because the decoder is forced to take into account the latent variable

Sounding Board: A User-Centric and Content-Driven Social Chatbot [chat]

Hao Fang, Hao Cheng, Maarten Sap, Elizabeth Clark, Ari Holtzman, Yejin Choi, Noah A. Smith, Mari Ostendorf
    • Dialog manager stores context and comminucates with a knowledge graph
  • NLU extracts the speaker's goals, the potential topic, and sentiment
  • DM is a hierarchical state-based dialog model, with a master than manages the overall conversation, and a collection of miniskills
  • Response generation consists of speech acts from four categories: grounding, inform, request, and instruction
  • The model is adapted to the user personality based on some probing questions
    • More extroverted personalities tend to rate the chatbot higher
  • Longer conversations usually received higher rating (but only slight correlation)

DialogWAE: Multimodal Response Generation with Conditional Wasserstein Auto-Encoder [chat]

Xiaodong Gu, Kyunghyun Cho, Jungwoo Ha, Sunghun Kim
  • DialogWAE models the data distribution by training a GAN within the latent variable space
  • Distribution of latent variable is modeled by GAN, which transforms random noise
    • This random noise is drawn from a normal distribution whose mean and covariance matrix are computed from the context with a feed-forward network
    • Optimization: minimize the wasserstein distance between prior and posterior, and the NLL of a reconstructed response
  • This is wrapped by an encoder-decoder architecture
    • At training the posterior is computed (based on context and response), and the decoder RNN computes the reconstruction loss from this
    • A discriminator (FFN) is trained to tell apart prior and posterior samples
  • Sampling from gaussian distribution doesn't capture the multimodal nature of responses
    • Thus a mixture of gaussian distributions is used
    • Gumbel-softmax is used to sample a gaussian component
  • Training is done by alternating between an AE phase where the reconstruction loss of responses is minimized and a GAN phase during which the aggregated posterior distribution of the latent variable is matched with the prior distribution
  • Evaluation metrics: BLEU, BOW embedding, distinct
    • For each context 10 reponses are sampled
    • Distinct measures the diversity of the responses
  • DialogWAE with gaussian mixture prior network outperforms all previous models, and also generates much longer responses

Reinforcing Coherence for Sequence to Sequence Model in Dialogue Generation [chat]

Hainan Zhang, Yanyan Lan, Jiafeng Guo, Jun Xu, Xueqi Cheng
  • Problem with generic responses in seq2seq is analyzed
  • The objective of seq2seq is the same as minimizing KL divergence between predicted and true probability
    • However this doesn't penalize enough the common responses where predicted prob. is high and true prob. is low
  • Use coherence between reply and input to estimate true prob.
    • Cosine sim., pretrained matching models,
    • And dual-learning agents: 2 seq2seq models
      • First agent generates response, second agent calculates coherence and sends it to first agent
      • Then this is repeated for the second agent as well
  • Coherence model is the reward function in an RL setting
  • For dual learning, the agents get the reward from each other
  • Slightly outperform baseline seq2seq and mmi, and adversarial seq2seq on both quantitative and human evaluation

Learning to Ask Questions in Open-domain Conversational Systems with Typed Decoders [chat]

Yansen Wang, Chenyi Liu, Minlie Huang, Liqiang Nie
Tensorflow (official)
  • 3 types of words are identified: interrogative, topic, ordinary
    • At decoding first a type distribution is estimated
  • Soft typed decoder estimates three type-specific generation distributions over the vocab.
  • Hard typed decoder uses gumbel-softmax to approximate argmax of the predicted types
    • Words are pre-classified in types for each input, and only words of the highest prob. type are generated (not over whole vocab.)
  • Significantly better on the distinct unigrams and bigrams metric than baseline seq2seq
    • Also much higher relevant topic word ratio in responses
  • Also much better according to human evaluation
  • Hard typed outperformed soft typed significantly
  • Error distribution analysis shows that errors fall in 3 categories almost evenly: no topic word, wrong topics, wrong word type

Multi-turn Dialogue Response Generation in an Adversarial Learning Framework [chat]

Oluwatobi O. Olabiyi, Alan Salimov, Anish Khazane, Erik T. Mueller
  • HRED (with attention) + GAN, trained with teacher forcing
  • MLE loss is also added to the loss function of the generator
  • There is a noise injected to the decoder of HRED, which assures that the model is not deterministic
  • Discriminator is a BiRNN on top of the same context RNN from the HRED generator network
  • At generation a list of responses is generated and ranked by the discriminator
  • Outperforms VHRED in automatic metrics, but no human evaluation is given
  • Depending on the dataset word or utterance level noise results in better performance

Zero-Shot Dialog Generation with Cross-Domain Latent Actions[chat]

Tiancheng Zhao, Maxine Eskenazi
PyTorch (official)
  • Dialog model that can generalize across domains from only a description of the domain
  • Description is made up of seed responses in the domain, and annotations of these (dialog acts)
  • Alternate between two losses during training
    • Optimize to make seed response representation close to its annotation representation
    • Optimize to make context representation close to response representation
  • HRE is used for context encoding, and the utterance encoder part is the same for seed response encoding (reused)
  • Model evaluated on synthetic restaurant data performs much better than standard seq2seq with copy

Towards Explainable and Controllable Open Domain Dialogue Generation with Dialogue Acts [chat]

Can Xu, Wei Wu, Yu Wu
  • Dialogs are annotated with dialog acts
    • 2 high level: context switch and context maintain
    • For each high level, 3 low levels: statement, question, answer
  • From the data it is concluded that context switch and questions are important to make a dialog longer
  • A dialog act classifier is learned based on the manual annotations
    • HRE encodes the dialog, and MLP at the end predicts dialog act probabilities for next uttrerance
    • Achieves 70% accuracy, it is employed to classify all dialog data
  • For the dialog model, the dialog act classifier is also trained (policy network), together with the response generator
    • Response generator is not hierarchical however, its inputs are the last two utterances and the predicted dialog act
  • After training the dialog model with supervised learning, they only train further the policy network with self-play reinforcement learning
    • Reward is the dialog length, and response relevance
    • Response relevance is a trained (with negative sampling) LSTM model estimating the relevance between response and a context
    • Dialogs are terminated if the utterances are repetative or a length limit is reached
  • Performance of the SL and RL model is better (than VHRED, RL-S2S baselines) only according to distinct metric
    • Also much better according to human evaluation, RL model is either very good or very bad, while SL is more average
    • Also longer average dialog length than RL-S2S, both in machine-machine and human-machine setting
  • The predicted dialog acts give very nice interpretability and controllability over the generated dialog
    • Context switch replies are generally longer than context maintain

Why Do Neural Response Generation Models Prefer Universal Replies? [chat]

Bowen Wu, Nan Jiang, Zhifeng Gao, Suke Li, Wenge Rong, Baoxun Wang
  • The target probability in response generation can be decomposed to 2 probabilities
    • First a set of suitable words has to be found
    • Then this set of words has to be ordered to form a response
  • Analysing the 2 probabilities
    • The set of words probability leads to optimizing for high frequency words from set of replies to given input
    • The word ordering probability just acts as a language model (basically independent from input)
  • A new loss is proposed, where there is a term trying to minimise the logprob of a randomly sampled negative response (not given to query)
    • This is considered more as a regularization than standard loss function
  • Slightly better than the simple seq2seq according to human evaluation and distinct metric

Aiming to Know You Better Perhaps Makes Me a More Engaging Dialogue Partner [chat]

Yury Zemlyanskiy, Fei Sha
  • The chatbot's goal is to choose utterances that elicit responses from the other agent which increase its understanding of it
  • There is a group of personality traits and the chatbot has to minimize so that it arrives to a set which characterizes the other agent
  • Maximize mutual information between dialog and revealed personality (discovery score)
  • Rerank beam search samples based on discovery score
  • Discovery score improves the engagingness of the chatbot in human evaluation

Generating More Interesting Responses in Neural Conversation Models with Distributional Constraints [chat]

Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan
Pytorch (official)
  • Models trained to maximize conditional likelihood assign low probability to content words compared to function words
  • A topic constraint is added to the training objective
    • A random variable is defined over topics, and the probability of this variable given the source, and the prob. given the output has to be similar (dot product)
    • HMM-LDA model is used to estimate topic probability distribution given a sentence (word-wise, so it works with beam search)
  • A semantic constraint is added, so that the source and output have to be similar (dot product)
    • Arora et al., SIF average word embedding used
  • Adding MMI to these constraints results in the best performance on diversity measuring metrics, and also human evaluation for content richness

Adversarial Over-Sensitivity and Over-Stability Strategies for Dialogue Models [chat]

Tong Niu, Mohit Bansal
  • Should-not-change attack
    • Random swap: swap adjacent words
    • Stopword dropout
    • Data-level paraphrasing: only change words by their synonims
    • Generative-level paraphrasing: sentence-level paraphrase using neural networks
    • Grammar errors: introduce real grammar errors based on huge corpus
  • Should-change attack
    • Negate the root verb, change verbs
    • Adjectives or adverbs to their antonyms
    • Turn utterances to random
    • Turn utterances to random but keep entities
    • Turn only entities to random
  • 3 types of training: train with normal data, evaluate on advers attacks, train with advers attack evaluate on advers attacks, train with advers attacks evaluate on normal data
    • For training on should-change attacks, use max-margin loss together with maximum likelihood
  • VHRED and RL model are generally not robust to the attacks, and training on adversarial data makes them more robust

Training Millions of Personalized Dialogue Agents [chat]

Pierre-Emmanuel Mazare, Samuel Humeau, Martin Raison, Antoine Bordes
  • From the huge reddit dump, they create personas using a profile's sentences that are about themselves
  • Dataset is constructed only as single turn
  • A retrieval model is used, and there is a separate persona and input encoder
  • Conditioning on personas clearly improves the recall metric
  • Transformer model achieves best performance
  • First training on the reddit data and then finetuning on persona-chat is much better than just training on persona-chat

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization [chat]

Yizhe Zhang, Michel Galley, Jianfeng Gao, Zhe Gan, Xiujun Li, Chris Brockett, Bill Dolan
  • Adversarial training to improve diversity
  • Variational information maximization to regularize the adversarial learning, and boost informativeness
  • A backward model is used to calculate this variational lower bound over the mutual information
  • CNN encoder is used, and its output is fed into LSTM decoder together with a random noise vector
  • Soft-argmax is used to make it differentiable, and to be able to use deterministic policy gradient
  • For the discriminator, the source, the target and the generated response are all projected to the same space (learned embedding)
    • Cosine similarity between projected S,T and S,T' is computed
  • Generator tries to minimize difference between projected S,T and S,T', while discriminator tries to maximize it
  • Evaluation metrics are BLEU, the 3 embedding-based and the two distinct metrics, and an entropy metric
  • On all of them the AIM is better than a seq2seq-MMI baseline and a GAN baseline
  • According to human evaluation the AIM is better in informativeness than MMI, but on par in relevance

Better Conversations by Modeling, Filtering, and Optimizing for Coherence and Diversity [chat]

Xinnuo Xu, Ondrej Dusek, Ioannis Konstas, Verena Rieser
Pytorch (official)
  • Latent variable based on context and based on coherence
  • Context gate to control the reliance on context or already generated response
    • This is dependent on coherence variable
    • But this coherence is computed based on dataset and fixed (better results than using the true coherence for each example)
  • Coherence measure: Cosine distance of source and response with stop word filtering
  • Base model is a cVAE
    • One of the losses is to minimize KL between prior and posterior network
    • This is why we can the z conditioned on prior at inference the same way as it would be on the posterior
  • The original opensubs corpus is used, and a filtered version where they filter based on the coherence of source-response pairs
  • The coherence based data filtering improves results across all metrics
  • CVAE generally outperforms baseline seq2seq across metrics (BLUE, and distinct and coherence metrics)

Talking to myself: self-dialogues as data for conversational agents [chat]

Joachim Fainberg, Ben Krause, Mihai Dobre, Marco Damonte, Emmanuel Kahembwe, Daniel Duma, Bonnie Webber, Federico Fancellu
  • Dataset available
  • 25000 self-dialogues collected from Mturk on several categories
  • The dialogues are shown to be of high quality

Neural Approaches to Conversational AI [chat]

Jianfeng Gao, Michel Galley, Lihong Li
  • 70 page long paper, which offers a good in-depth introduction to the many aspects of conversational AI as a field.
  • Abstract: The present paper surveys neural approaches to conversational AI that have been developed in the last few years. We group conversational systems into three categories: (1) question answering agents, (2) task-oriented dialogue agents, and (3) chatbots. For each category, we present a review of state-of-the-art neural approaches, draw the connection between them and traditional approaches, and discuss the progress that has been made and challenges still being faced, using specific systems and models as case studies.

Contextual Topic Modeling for Dialog Systems [chat]

Chandra Khatri, Rahul Goel, Behnam Hedayatnia, Angeliki Metanillou, Anushree Venkatesh, Raefer Gabriel, Arindam Mandal
  • Data from 2017 alexa prize is annotated with dialog acts and topics
    • Keywords useful for determining topic are also labeled
  • Dialogs are also rated by humans for coherence and engagement (based on 4 yes-no questions about the dialog)
  • Topical depth (number of consecutive on-topic utterances) highly correlated with coherence and engagement
  • CDAN and CADAN, models extending the originals with context
    • Either average of utterances is used or dialog acts as context
  • BiLSTM performs best (classification accuracy) with added context and dialog acts
  • Context extended ADAN performs best for keyword detection

Automatic Evaluation of Neural Personality-based Chatbots[chat]

Yujie Xing, Raquel Fernandez
  • Li et al. Persona model modified to work with personality types (OCEAN score for 5 types)
  • Ocean score for each speaker is computed based on a number of utterances from that speaker
  • Pretraining on opensubs, because the tv-series dataset used is only 100k samples
  • Sample utterances are computed for each personality on a test set
  • The Ocean score is able to distinguish somewhat well between personalities (60%)
  • Baseline model achieves only 0.16 F1
  • Original persona model achieves better distinguishability than the personality model (normal)
    • They are both higher than baseline
  • The personalities can be interpolated between the 5 types and if extremes are used 0.53 F1 can be achieved

NEXUS Network: Connecting the Preceding and the Following in Dialogue Generation [chat]

Hui Su, Xiaoyu Shen, Wenjie Li, Dietrich Klakow
  • Response should connect context history and future responses
    • Achieve this by maximizing MMI of current utterance with both past and future contexts
  • Replace utterance with continuous code space learned from the whole dialog flow
    • Follow gaussian distribution
  • Dialog history and future encoded with hierarchical rnn
    • MLP on top to estimate gaussian mean and covarianve
  • Based on history and code space the decoder computes output
  • At test time the code space is sampled based on only the history (prior distribution)
    • Use variational inference to maximize variational lower bound
  • Better than VHRED baseline in automatic metrics and human evaluation as well

MultiWOZ - A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling [chat]

Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Inigo Casanueva, Stefan Ultes, Osman Ramadan, Milica Gasic
  • 10k task-oriented annotated dialogs
  • Dataset

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding [s2s]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
Tensorflow (official)
  • Model is the encoder part of a normal Transformer
  • Masked language model: mask some words in a sentence and predict these based on the others
  • 3-way masking: sometimes mask the word, sometimes replace it with a random word and sometimes keep the original word
  • Also pre-train the model for next sentence prediction:
    • Half the time a sentence is the true next sentence, and half the time it is random
      • Model has to predict a binary label
  • For classification fine-tuning just a classification layer is added, and all paramters are finetuned
  • For other types of tasks specific finetuning layers are added
  • It beats previous SOTA an all GLUE tasks
  • Extensive ablation study is conducted for pre-training type, number of steps and model size

The RLLChatbot: a solution to the ConvAI challenge [chat]

Nicolas Gontier, Koustuv Sinha, Peter Henderson, Iulian Serban, Michael Noseworthy, Prasanna Parthasarathi, Joelle Pineau
  • They have some new dataset, but no link yet
  • Ensemble model with a ranker that ranks generated responses, and then select one to output
    • Generative, retrieval and rule-based systems
    • Neural question generator generates a question based on the news article
  • The dataset has a news article at the beginning of a dialog, which the dialog should be about
  • Supervised scoring:
    • Predict human vote (from dataset) based on conversation history
    • There are many different features used for this classifier
    • Classifier achieves 64% accuracy
  • RL based scorer:
    • Estimate the q-value of a response (expected reward after a response)
    • Reward is a weighted version of the vote signal
    • Deep q-network used
  • Since supervised scorer is not that good mainly a set of designed rules are used to select a response during the dialog
  • Data was also collected by the user selecting the best response among candidates
    • With this data the supervised scorer proved best with a policy of choosing the response with highest score

Neural Response Ranking for Social Conversation: A Data-Efficient Approach [chat]

Igor Shalyminov, Ondřej Dušek, Oliver Lemon
Tensorflow (official)
  • Dataset from the 2017 Alexa prize
    • Length correlates somewhat more with positive feedback than negative feedback
    • Length correlates poorly with user rating
  • Ranker takes as input previous utterances and other features like sentiment and names
    • MLP at the end outputs rating (or dialog length)
  • Evaluation is done with sentiment analysis, to check goodness of replies against a set of positive replies
    • How well can the ranker distinguish between positive and negative replies
  • Training with dialog length achieves slightly better performance than user rating (at a sufficiently big dataset size)

Generating Multiple Diverse Responses for Short-Text Conversation [chat]

Jun Gao, Wei Bi, Xiaojiang Liu, Junhui Li, Shuming Shi
Pytorch (official)
  • Generate a set of responses for each input (bag of instances)
  • Latent space consisting of the vocabulary, from which to sample a word based on which the reply should be
  • Model consists of a latent word inference and a response generation network
    • Response generator encodes the input and sampled words to generate a set of responses
    • Use the minimum of individual losses of responses as overall loss
  • Pretrain the latent word inference network on keyword extraction task
  • Pretrain generator network using top 1 inferred latent word
  • Then jointly train them, using RL for the word inference network, and backprop for the generator
  • Only sample from a smaller set of words (specified for each input), because of huge latent space
  • The proposed model is much better according to human evaluation than S2S and CVAE baselines

Wizard of Wikipedia: Knowledge-Powered Conversational agents[chat]

Emily Dinan, Stephen Roller, Kurt Shuster, Angela Fan, Michael Auli, Jason Weston
  • Dataset constructed where there is a wizard and an apprentice
    • They have to talk about some topics, but the wizard has access to relevant wikipedia articles
    • The relevant wikipedia article retrieval model is a simple fixed model
    • The wizard chooses a relevant article and sentence to his/her response (which will be used in the dataset)
  • Input is the initial topic and the utterances so far
    • Plus the retrieved sentences from wikipedia are attended to with a Transformer
    • The top knowledge sentence based on attention is selected, and further encoded together with the dialog context
  • Cross-entropy loss extended with a term to select the sentence from the articles which the annotator also selected
  • Transformer achieves a 25 R@1 for finding the correct knowledge sentence (better than MemNet)
  • The Transformer is both used in a retrieval and generative dialog setting:
    • Using the gold or predicted knowledge greatly improves performance for retrieval and generative models, and using gold knowledge is better
  • The two-stage generative transformer is better with predicted knowledge, while the end-to-end is better with gold knowledge
  • Pretraining on reddit improved the performance everywhere
  • According to human score retrieval transformer is better than generative, but generative gains bigger relative improvement from using wikipedia knowledge

Importance of a Search Strategy in Neural Dialogue Modelling[chat]

Ilya Kulikov, Alexander H. Miller, Kyunghyun Cho, Jason Weston
  • Comparing greedy, beam, and iterative beam search
  • Iterative beam search:
    • Run more beam searches, but exclude prior hypotheses, by setting their score to negative infinity
    • Thus the candidates are guaranteed to be dissimilar
  • A ranking term is added to the loss function (ranking negative responses lower)
    • Iterative beam search is best according to full length human dialog evaluation

A Study on Dialogue Reward Prediction for Open-Ended Conversational Agents [chat]

Heriberto Cuayáhuitl, Seonghan Ryu, Donghyeon Lee, Jihie Kim
  • Automatically derive dialog rewards for a dialog dataset
    • Positive reward if the response is in the dataset, negative to randomly sampled responses
    • Reward for the dialog is sum of all rewards
    • Generate dialogs with varying number of randomly sampled responses
      • Thus extend dataset from 20k to 150k dialogs
  • The model is a small 2-layer RNN, with a dense layer at the end
  • They experiment with different dialogue history lengths
    • The bigger the dialog history the better, with a max of 0.81 correlation between predicted and true reward when using 25 sentences