References
[Allen 1987] Several Studies on Natural Language and Back-Propagation. IEEE First International Conference on Neural Networks, vol. 2, 1987. http://boballen.info/RBA/PAPERS/NL-BP/nl-bp.pdf
[Auli & Gao, ACL'13] Decoder Integration and Expected BLEU Training for Recurrent Neural Network Language Models. https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/acl2014_expbleu_rnn.pdf
[Auli, Galley, Quirk, Zweig, EMNLP'13] Joint Language and Translation Modeling with Recurrent Neural Networks. http://research-srv.microsoft.com/en-us/um/people/gzweig/Pubs/EMNLP2013RNNMT.pdf
[Bahdanau et al., ICLR’15] Neural Translation by Jointly Learning to Align and Translate. http://arxiv.org/pdf/1409.0473.pdf
[Bahdanau et al., arXiv 2016] An Actor-Critic Algorithm for Sequence Prediction. https://arxiv.org/abs/1607.07086
[Bengio, Ducharme, Vincent, Jauvin, 2003] A neural probabilistic language model. JMLR. http://www.jmlr.org/papers/v3/bengio03a.html
[Bengio, Simard, Frasconi, 1994] Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks. 1994. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=279181
[Brown et al., 1990] A statistical approach to machine translation. CL. http://dl.acm.org/citation.cfm?id=92860
[Chan, Jaitly, Le, Vinyals, ICASSP 2016] Listen, Attend and Spell. http://arxiv.org/abs/1508.01211
[Chrisman 1992] Learning recursive distributed representations for holistic computation. Connection Science 3(4):345-366. http://repository.cmu.edu/cgi/viewcontent.cgi?article=3061&context=compsci
[Cho, arXiv 2016] Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model. https://arxiv.org/abs/1605.03835
[Chung, Gulcehre, Cho, Bengio, DLUFL'15] Empirical evaluation of gated recurrent neural networks on sequence modeling. http://arxiv.org/abs/1412.3555
[Chung, Cho, Bengio, ACL’16]. A Character-Level Decoder without Explicit Segmentation for Neural Machine Translation. http://arxiv.org/pdf/1603.06147.pdf
[Cohn, Hoang, Vymolova, Yao, Dyer, Haffari, NAACL’16] Incorporating Structural Alignment Biases into an Attentional Neural Translation Model. https://arxiv.org/pdf/1601.01085.pdf
[Devlin et al., ACL'14] Fast and Robust Neural Network Joint Models for Statistical Machine Translation. http://acl2014.org/acl2014/P14-1/pdf/P14-1129.pdf
[Dong, Wu, He, Yu, Wang, ACL’15]. Multi-task learning for multiple language translation. http://www.aclweb.org/anthology/P15-1166
[Elman 1990] Finding structure in time. Cognitive science, 14(2):179–211, 1990.
[Eriguchi, Hashimoto, Tsuruoka, ACL'16] Tree-to-Sequence Attentional Neural Machine Translation. http://arxiv.org/abs/1603.06075
[Firat, Cho, Bengio, NAACL’16]. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism. https://arxiv.org/pdf/1601.01073.pdf
[Firat, Cho, Bengio, 2016c]. Multi-Way, Multilingual Neural Machine Translation with a Shared Attention Mechanism. CSL. (to appear)
[Firat et al., EMNLP'16] Zero-Resource Translation with Multi-Lingual Neural Machine Translation. http://arxiv.org/abs/1606.04164
[Gers, 2001] Long short-term memory in recurrent neural networks. PhD Thesis. http://felixgers.de/papers/phd.pdf
[Gu, Lu, Li, Li, ACL’16] Incorporating Copying Mechanism in Sequence-to-Sequence Learning. https://arxiv.org/pdf/1603.06393.pdf
[Gulcehre, Ahn, Nallapati, Zhou, Bengio, ACL’16] Pointing the Unknown Words. http://arxiv.org/pdf/1603.08148.pdf
[Hochreiter et al., 2001] Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. 2001. http://www.bioinf.jku.at/publications/older/ch7.pdf
[Hochreiter & Schmidhuber, 1997] Long Short-term Memory. http://deeplearning.cs.cmu.edu/pdfs/Hochreiter97_lstm.pdf
[Jean, Cho, Memisevic, Bengio, ACL'15] On Using Very Large Target Vocabulary for Neural Machine Translation. https://arxiv.org/abs/1412.2007
[Jelinek, 1969] Fast sequential decoding algorithm using a stack. IBM Journal of Research and Development. http://dx.doi.org/10.1147/rd.136.0675
[Jordan 1986] Serial order: A parallel distributed processing approach. Advances in psychology, 121:471–495, 1997.
[Kim, Jernite, Sontag, Rush, AAAI’16]. Character-Aware Neural Language Models. https://arxiv.org/pdf/1508.06615.pdf
[Kim & Rush, arXiv 2016] Sequence-level knowledge distillation. https://arxiv.org/abs/1606.07947
[Ji, Haffari, Eisenstein, NAACL’16] A Latent Variable Recurrent Neural Network for Discourse-Driven Language Models. https://arxiv.org/pdf/1603.01913.pdf
[Ji, Vishwanathan, Satish, Anderson, Dubey, ICLR’16] BlackOut: Speeding up Recurrent Neural Network Language Models with very Large Vocabularies. http://arxiv.org/pdf/1511.06909.pdf
[Jia, Liang, ACL’16]. Data Recombination for Neural Semantic Parsing. https://arxiv.org/pdf/1606.03622.pdf
[Kalchbrenner & Blunsom, EMNLP'13] Recurrent Continuous Translation Models. http://anthology.aclweb.org/D/D13/D13-1176.pdf
[Kingma & Ba, ICLR'15] Adam: A Method for Stochastic Optimization. http://arxiv.org/abs/1412.6980
[Koehn, Och, Marcu, NAACL'03] Statistical phrase-based translation. http://dl.acm.org/citation.cfm?id=1073462
[Ling, Luís, Marujo, Astudillo, Amir, Dyer, Black, Trancoso, EMNLP’15]. Finding Function in Form: Compositional Character Models for Open Vocabulary Word Representation. http://arxiv.org/pdf/1508.02096.pdf
[Luong et al., ACL’15a] Addressing the Rare Word Problem in Neural Machine Translation. http://www.aclweb.org/anthology/P15-1002
[Luong et al., ACL’15b] Effective Approaches to Attention-based Neural Machine Translation. https://arxiv.org/pdf/1508.04025.pdf
[Luong & Manning, IWSLT’15] Stanford Neural Machine Translation Systems for Spoken Language Domain. http://nlp.stanford.edu/pubs/luong-manning-iwslt15.pdf
[Mikolov, 2012] Statistical Language Models Based on Neural Networks. PhD Thesis.
[Mikolov et al., Interspeech'10] Recurrent neural network based language model. http://www.fit.vutbr.cz/research/groups/speech/servite/2010/rnnlm_mikolov.pdf
[Mnih & Hinton, NIPS’09] A Scalable Hierarchical Distributed Language Model. https://www.cs.toronto.edu/~amnih/papers/hlbl_final.pdf
[Mnih & Teh, ICML’12] A fast and simple algorithm for training neural probabilistic language models. https://www.cs.toronto.edu/~amnih/papers/ncelm.pdf
[Mnih et al., NIPS’14] Recurrent Models of Visual Attention. http://papers.nips.cc/paper/5542-recurrent-models-of-visual-attention.pdf
[Morin & Bengio, AISTATS’05] Hierarchical Probabilistic Neural Network Language Model. http://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf
[Papineni, Roukos, Ward, Zhu, ACL'02] BLEU: a method for automatic evaluation of machine translation. http://dl.acm.org/citation.cfm?id=1073135
[Pascanu, Mikolov, Bengio, ICML'13] On the difficulty of training recurrent neural networks. http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf
[Pollack, 1990] Recursive distributed representations. Artificial Intelligence. http://www.sciencedirect.com/science/article/pii/000437029090005K
[Ranzato, Chopra, Auli, Zaremba, ICLR'16] Sequence level training with recurrent neural networks. http://arxiv.org/abs/1511.06732
[Saxe, McClelland, Ganguli, ICLR'14] Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. http://arxiv.org/abs/1312.6120
[Schwenk, 2007] Continous space language model. CSL. http://www.sciencedirect.com/science/article/pii/S0885230806000325
[Schwenk, COLING'12] Continuous Space Translation Models for Phrase-Based Statistical Machine Translation. http://www.aclweb.org/old_anthology/C/C12/C12-2.pdf#page=1085
[Schwenk, Costa-Jussa, Fonollosa, IWSLT'06] Continuous space language models for the IWSLT 2006 task. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.491.4144
[Schwenk, Rousseau, Attik, NAACL'12] Large, pruned or continuous space language models on a gpu for statistical machine translation. http://dl.acm.org/citation.cfm?id=2390942
[See, Luong, Manning, CoNLL'16] Compression of Neural Machine Translation Models via Pruning. http://arxiv.org/abs/1606.09274
[Sennrich, Haddow, Birch, ACL’16a]. Improving Neural Machine Translation Models with Monolingual Data. http://arxiv.org/pdf/1511.06709.pdf
[Sennrich, Haddow, Birch, ACL’16b]. Neural Machine Translation of Rare Words with Subword Units. http://arxiv.org/pdf/1508.07909.pdf
[Serban et al., AAAI'16] Building end-to-end dialogue systems using generative hierarchical neural network models. https://arxiv.org/abs/1507.04808
[Shen et al., ACL'16] Minimum Risk Training for Neural Machine Translation. http://arxiv.org/abs/1512.02433
[Sutskever, Martens, Hinton, ICML'11] Generating text with recurrent neural networks. http://machinelearning.wustl.edu/mlpapers/paper_files/ICML2011Sutskever_524.pdf
[Sutskever et al., NIPS’14] Sequence to Sequence Learning with Neural Networks. http://papers.nips.cc/paper/5346-sequence-to-sequence-learning-with-neural-networks.pdf
[Tu, Lu, Liu, Liu, Li, ACL’16] Modeling Coverage for Neural Machine Translation. http://arxiv.org/pdf/1601.04811.pdf
[Vaswani, Zhao, Fossum, Chiang, EMNLP’13] Decoding with Large-Scale Neural Language Models Improves Translation. http://www.isi.edu/~avaswani/NCE-NPLM.pdf
[Wang, Cho, ACL’16]. Larger-Context Language Modelling with Recurrent Neural Network. http://aclweb.org/anthology/P/P16/P16-1125.pdf
[Wiseman & Rush, arXiv 2016] Sequence-to-Sequence Learning as Beam-Search Optimization. http://arxiv.org/abs/1606.02960
[Xu, Ba, Kiros, Cho, Courville, Salakhutdinov, Zemel, Bengio, ICML’15] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. http://jmlr.org/proceedings/papers/v37/xuc15.pdf
[Zeiler, arXiv 2012] ADADELTA: an adaptive learning rate method. http://arxiv.org/abs/1212.5701
[Zoph, Knight, NAACL’16]. Multi-source neural translation. http://www.isi.edu/natural-language/mt/multi-source-neural.pdf
[Zoph, Vaswani, May, Knight, NAACL’16] Simple, Fast Noise Contrastive Estimation for Large RNN Vocabularies. http://www.isi.edu/natural-language/mt/simple-fast-noise.pdf