How Transferable are Neural Networks in NLP Applications?

0. Copyright

(C) 2016. All rights reserved.

All material is freely available for non-commercial purposes. If you use it for research, please cite:

Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, Zhi Jin. How Transferable are Neural Networks in NLP Applications? In EMNLP, 2016. 


1. Download

URL: http://pan.baidu.com/s/1mgOv0vA
How? Press the "下载(15.9M)" button (Ctrl-F -> Paste -> Click)


2. Main Findings

RQ1: How transferable are neural networks between two tasks with similar or different semantics in NLP applications?

Whether a neural network is transferable in NLP depends largely on how  semantically similar the tasks are.


RQ2: How transferable are different layers of NLP neural models?

The output layer is mainly specific to the dataset and not transferable. The performance gain of neural domain adaptation comes mainly from transferring hidden layers.  Word embeddings are likely to be transferable to semantically different tasks, but the boost is not large if they have been pretrained in an unsupervised way on a large corpus.


RQ3: How transferable are INIT and MULT, respectively? What is the effect of combining these two methods?

MULT appears to be slightly better than (but generally comparable to) INIT in our experiment; combining MULT and INIT does not result in further gain.

Additional findings:

Q: How does learning rate affect transfer?

A: Transferring learning rate information is not necessarily useful. A large learning rate does not damage the knowledge stored in the pretrained hyperparameters, but accelerates the training process to a large extent. In all, we may need perform validation to choose the learning rate if computational resources are available.


Q: When is it ready to transfer?

A: Results are not consistent. 

When we evaluated on the SNLI-->SICK dataset, we find that surprisingly only a few epochs over the source dataset are sufficient to capture transferable knowledge, although the source task performance has not been optimal. However, in our new version, we evaluated this hypothesis on a new dataset (IMDB->MR), where the learning curves of S and T align well. We do not have explanation about this.