Introduction:
This work was done while I was interning at Deakin university under supervision of Prof. Jinho Choi. The Concept of transferring knowledge between machine learning models came into limelight in 2015 (Hinton et. al). The distillation scenario often consists of a teacher (well trained and generally complex model) and student (less complex and limitedly trained). There can be multiple of them. Hinton et. al focused on improving the student model by a combination of knowledge transfer between the ground truths and predictions of the teacher. However, in reality there might be cases when ground truths (labels) may not be available. In such cases we solely need to rely on the predictions of the teacher. Till date limited research has been done specific to this.
Initial Setup:
We are focusing on the scenario where the labels/ground truth for the data are not available during the distillation process. For experimental dataset we chose to go with readily available Fashion MNIST dataset.
We setup two models, essentially the teacher and student. For sake of simplicity, we just limit ourselves to one teacher and one student only. The details of the teacher and student are given below:
Teacher: CNN architecture with 3 hidden Convolutional layers and 2 fully connected layers with SoftMax activation at the output layer
Student: Multilayer perceptron with 2 fully connected layers and SoftMax activation at the output layer
Initially the teacher is trained to full extent using the images and labels from the dataset. It roughly achieves 90% accuracy in prediction on the test set. The student model is constrained to limited training purposely. It exhibits an accuracy of 39% on the same test set.
Distillation Idea:
During distillation, we completely rely on the responses from the teacher. A completely different and untouched subset of the dataset is used for this process of which neither where models have seen the samples from it. Simply, we take the help of the responses given by the teacher and student along with the intermediate feature maps. The main objective is to reduce the discrepancy between teacher's and student's prediction. This can be done by minimizing the KL-Divergence between both the predicted distributions. Now, this process is further accelerated by the imposing the fact that the intermediate feature representation during the classification process should also be similar to each other. However, due to different architecture of the models there exists differences on feature dimensions and more importantly feature representation. We don't know what feature from 2nd Convolution layer represents in teacher model and whether it needs to be similar to the output of the first hidden layer of the student.
Hence, we map the features into higher dimensions using affine transformations assuming there exists similarity. It is because the higher the dimension gets more degrees of freedom are introduced which more or less point towards the same direction of information given the task is same for both the models.
Training Process:
We finally, breakdown our objective function which needs to be minimized, into two parts. The first part represents the KL-Divergence between the student's (sp) and teacher's prediction (tp) and for the second one we use cosine similarity between the high dimensional feature vectors. Hence, the objective function becomes:
L = minimize { a * KL (sp || tp) - (1-a) * cosine_similarity(sf, tf)} , a ∈ (0, 1)
Results:
The initial accuracy of the student model as mentioned was at 39%.
Using only the KL divergence loss between the responses of the two models the student model improves to 83%
Using KL divergence along with cosine similarity between the intermediate high dimensional representation of the models' feature maps, the student model improves to 86%.
Hence, with this hybrid combination of response based and feature based knowledge distillation we can improve the efficiency of the models.