Paper Summaries

I write summaries of papers I read. The idea is to keep brief important points so that I can visit when needed.

Difference between Domain Adaptation and Transfer Learning?

There are some disagreement with the difference between 'Transfer Learning (TL)' and 'Domain Adaptation (DA).' I am putting my own views here. Disclaimer: This should be taken as a loose guideline rather than as definition. I think of transfer learning as off-line procedure of learning from a separate, out-of-domain data where you have a set of labelled data available and the model is "pre-trained" and is made available to others without thinking about the down-stream task. However, in case of DA I like to think of scenarios when data from both domains are available simultaneously and the algorithm is created thinking about both the data. Algorithmic choices are made to learn from both available data. Probably, streaming setup could be an example of where the algorithm takes both data and makes decision based on the presence of both data simultaneously. However, I do not mean to say that TL can not or should not be applied in streaming setup. In my mind streaming setup relates closely to DA than TL.

Zero-Shot Knowledge Distillation in Deep Networks (Nayak et al., 2019):

Distillation works better when "Dark Knowledge" is also utilized effectively.

Dark Knowledge (DK): Confidence scores of T for classes other than the True class. DK captures hidden information like class similarities and helps better generalize S.

  1. Generate class template vectors using the weight matrix $W$ of final and pre-final layer of T. Each row $w_i$ of $W$ captures the similarity between class $i$ with all classes.

  2. Calculate Class Similarity Matrix (CSM) between classes $i$ and $j$ using $\frac{w_i^Tw_j}{\|w_i\|\|w_j\|}$.

  3. Use generated class similarity vector $c_k$ for class $k$ as parameter of a Dirichlet distribution $Dir(K, c_k)$ to sample multiple logits $y^k_i$ for that class $k$.

  4. Scale $c_k$ with param $\beta$ to get better scale the input parameters of $Dir(K, c_k) \rightarrow Dir(K, \beta \times c_k)$

  5. For each generated logit $y_i$, initialize a random input $x$ and pass it through T.

  6. Minimize Cross-Entropy loss between sampled logit $y_i$ and output of T i.e. $T(x, \tau)$ to generate Data Impressions (DIs).

  7. Train S using generated DIs.


Questions:

  1. Section 3.3: Why "K" is called a parameter?

  2. Scaling factor: challenges with efficiently tune $B$ number of hyperparameters?

  3. How easy it is to optimize $x$ in Equation 3?

  4. How $\tau$ is used in Equation 3? Shouldn't it be used just after $y^k_i$ generation?

Zero-shot Knowledge Transfer via Adversarial Belief Matching (Micaelli & Storkey, 2019):

Employ a Generator to generate better pseudo data.


Algorithm:

  1. Sample noise data from distribution: $z \sim \mathcal{N}(\textbf{0},\textbf{I})$

  2. Pass $z$ as an input to generator $G(z,\phi)$ and get pseudo data $x_p$ by maximizing the KL divergence between T and S such that generator learns to generate data which fools S

  3. Use generated data $x_p$ to minimize the KL divergence between T and S

  4. Minimizing the KL divergence between T and S such that S follows similar activation value to T in all layers

  5. Train S using generated pseudo data.


Questions:

  1. Why high entropy makes it harder to fool S?

  2. Activation Block (AB): Is AB something specific to CNNs?

  3. Understanding Equation 1:

    1. Does 2nd term enforces S to have similar activation value with T?

    2. Does Equation 1 mandates T and S to have same number of layers? Else how $N_L$ is calculated?

DeGAN: Data-Enriching GAN for Retrieving Representative Samples from a Trained Classifier (Ad-depalli et al., 2019):

Objective Generate samples from a trained model using proxy data. Employ a Generator to generate better pseudo data.


Techniques:

  1. Standard GAN training via alternative minimization and Generator and Discriminator

  2. Classifier parameters are always frozen

  3. Measure discriminator loss between proxy data and generated image

  4. Enforces entropy and diversity by modifying generator loss.

    • Minimize entropy: Peaky class probabilities

    • Maximize diversity: Balanced number of samples per class


Algorithm:

  1. Generate random sample from latent space

  2. Give generated sample to Discriminator and measure with loss from proxy dataset

  3. Pass the same generated sample to Classifier


Questions:

  1. How does generated image solves the problem of diversity in the medical example in the paper?

Distilling Transformers into Simple Neural Networks with Unlabelled Transfer Data (Mukherjee & Awadallah, 2019):

Studies task-specific knowledge distillation in the presence of a limited amount of labelled training data and a large amount of unlabelled transfer data.


Explores several aspects of the distillation process like the distillation objective, how to harness the teacher representations, the training schedules, the impact of the amount of labelled training data, the size of the teacher, soft vs hard distillation, etc. Showed BiLSTM model to work at par with attention based models like BERT.


For Hard distillation, trains S using the predicted classes on U of T.

Followed by 2 BiLSTM layers. Employs max-pooling over all token hidden representation to generate document representation.


NOTE: Table 2 summarises different approaches and their dependencies.


Algorithm:

  1. Fine-Tune the T with limited task-specific labelled data $D$.

  2. Generate logits\footnote{Logits defined as $logit(p) = \frac{p}{1-p}$} and last layer representations on unlabelled data $U$ using T

  3. Minimize MSE between representation and logits generated by T and S separately

    • Pass student generated embedding through a special extra layer to match T's embedding dimension

  4. Use D to further fine-tune S in gradual unfreezing manner

TextKD-GAN: A minimax game for unifying generative and discriminative information retrieval models (Haidar & Rezagholizadeh, 2019):

Using GAN on text directly is challenging due to discreet nature of text. Specifically, the argmax operation between generator and discriminator blocks the gradient flow.


Traditionally one-hot representation of tokens were used as the input to discriminator and a softmax output of the generator is used. However, it becomes easy for the discriminator to differentiate between one-hot representation and softmax output. Reconstruct one-hot input using an autoencoder. Match softmax output of generator with the output of an autoencoder.


Algorithm:

  1. Fine-Tune the T with limited task-specific labelled data $D$.

  2. Generate logits\footnote{Logits defined as $logit(p) = \frac{p}{1-p}$} and last layer representations on unlabelled data $U$ using T

  3. Minimize MSE between representation and logits generated by T and S separately

    • Pass student generated embedding through a special extra layer to match T's embedding dimension

  4. Use D to further fine-tune S in gradual unfreezing manner

A Simple Framework for Contrastive Learning of Visual Representations


Usecase: Unsupervised representation learning for vision. Generative techniques such as VAE or GAN are computationally expensive as they generate full-size images. Multiple discriminative learning techniques exist but might not very helpful for downstream tasks.


Idea: Uses contrastive task to project similar samples (transformed/augmented samples) closer by maximizing agreement. Given a single example x, augment to generate x_i and x_j. Project generated examples to a latent space (f(.)) using a NN (ResNet) and a projection head (2 layer non-linear MLP). Identify f(x_j) given f(x_i) among some other randomly selected examples. Contrastive layers loss is calculated by normalized cross-entropy loss.

Countering Language Drift with Seeded Iterated Learning


Idea: To mitigate language drift, a phenomenon of pretrained models losing it's syntactic and semantic knowledge when fine-tuned for specific task, authors proposed a teacher-student paradigm where teacher learn by interaction for the specific task and finally generates a filtered dataset for the student to train on to maintain language specific information. Teacher is generated by copying student before fine-tuning and student mimics the output of the teacher. This teacher creation and learning are performed iteratively.

"Learning Bottleneck", a process to restrict learning by favouring structural properties (could not understand this process clearly), is responsible for retaining language properties which are achieved by regularization such as limiting the number of imitation steps. Got "Learning-bottleneck" now: Learning-bottleneck works as a regularizer by not providing the full information to student. As a result, all impurities generated during interactive learning is dropped. (This iterative process somewhat relates to self-learning but instead of learning from predicted labels, teacher chooses the candidate samples from original dataset to train on)


Related works: 1. Uses external labeller dataset for visual grounding or reward (incorporate language properties to loss) or KL minimization. 2. Population-based: enforces social grounding through agent. 3. Alternate training between interactive and supervised phase.

Educating Text Autoencoders: Latent Representation Guidance via Denoising


Idea: I felt, idea of this paper is pretty simple, focuses heavily on the theoretical explanation: Instead of regenerating original input sentence, first generate augmented sentences by randomly dropping words. Ask the model to regenerate original sentence from all augmented sentences




References

  • Chen, Ting, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. "A simple framework for contrastive learning of visual representations." arXiv preprint arXiv:2002.05709 (2020).

  • Lu, Yuchen, Soumye Singhal, Florian Strub, Olivier Pietquin, and Aaron Courville. "Countering language drift with seeded iterated learning." arXiv preprint arXiv:2003.12694 (2020).

  • Shen, Tianxiao, Jonas Mueller, Regina Barzilay, and Tommi Jaakkola. "Educating Text Autoencoders: Latent Representation Guidance via Denoising." arXiv preprint arXiv:1905.12777 (2019).

  • Addepalli, S., Nayak, G. K., Chakraborty, A., and Babu,R. V. Degan: Data-enriching gan for retrieving representative samples from a trained classifier. arXiv preprint arXiv:1912.11960, 2019.

  • Chen, H., Wang, Y., Xu, C., Yang, Z., Liu, C., Shi, B.,Xu, C., Xu, C., and Tian, Q. Data-free learning of student networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3514–3522, 2019.

  • Ganesh, P., Chen, Y., Lou, X., Khan, M. A., Yang, Y., Chen,D., Winslett, M., Sajjad, H., and Nakov, P. Compressing large-scale transformer-based models: A case study on bert. arXiv preprint arXiv:2002.11985, 2020.

  • Goldblum, M., Fowl, L., Feizi, S., and Goldstein, T. Adversarially robust distillation. arXiv preprint arXiv:1905.09747, 2019.

  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of wasserstein gans. In Advances in neural information processing systems,pp. 5767–5777, 2017.

  • Haidar, M. A. and Rezagholizadeh, M. Textkd-gan: Text generation using knowledge distillation and generative adversarial networks. In Canadian Conference on Artificial Intelligence, pp. 107–118. Springer, 2019.

  • Lopes, R. G., Fenu, S., and Starner, T. Data-free knowledge distillation for deep neural networks. arXiv preprint arXiv:1710.07535, 2017.

  • Micaelli, P. and Storkey, A. J. Zero-shot knowledge transfer via adversarial belief matching. In Advances in Neural Information Processing Systems, pp. 9547–9557, 2019.

  • Mukherjee, S. and Awadallah, A. H. Distilling transformers into simple neural networks with unlabeled transfer data. arXiv preprint arXiv:1910.01769, 2019.

  • Nayak, G. K., Mopuri, K. R., Shaj, V., Babu, R. V., and Chakraborty, A. Zero-shot knowledge distillation in deep networks. arXiv preprint arXiv:1905.08114, 2019.

  • Zagoruyko, S. and Komodakis, N. Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. arXiv preprint arXiv:1612.03928, 2016.