WHAT IS ZERO-SHOT LEARNING?
The goal of zero-shot learning is to identify things whose instances might not have been encountered during training. In other words, zero-shot learning is a method to solve the problem of unseen classes.
There are essentially two variants for the Zero Shot learning problem based on the samples used for testing the model. They are:
Generalized Zero Shot Learning (GZSL): In this variant, the model is trained on a collection of seen classes and tested on a collection of both seen and unseen classes. The model is expected to perform well on both seen and unseen classes.
Zero Shot Learning (ZSL): In this variant, the model is trained on a collection of seen classes and tested on a collection of unseen classes. The model is expected to perform well only on unseen classes.
We have utilized Generalized Zero Learning setting.
EXTRACTING MULTIPLE VIEWS
Source - original MVCNN paper
In MVCNN, a 3D shape is rendered from 12 different views and passed through CNN1 to extract views-based features. Then these features are pooled across views and passed through CNN2 to extract shape-based features. The shape-based features are then used for classification. The view pooling operation is very similar to the pooling operation done in CNN. This pooling is done across the views to extract shape-based features.
PROMPT LEARNING
Conventional methods for instructing expansive language models involve an initial phase of pre-training using unlabeled data, followed by fine-tuning on labeled data tailored to specific tasks. Conversely, prompt-based learning models possess the ability to autonomously adapt to diverse tasks by assimilating domain knowledge conveyed through prompts.
The effectiveness of a prompt-based model's output is significantly influenced by the prompt's quality. A meticulously formulated prompt aids the model in producing outputs that are more precise and pertinent, whereas a inadequately constructed prompt may result in outputs that are disjointed or irrelevant.
Triplet Loss
Triplet loss operates on sets of three data points, typically referred to as triplets. Each triplet consists of an anchor data point, a positive data point, and a negative data point:
Anchor: The anchor is a data point for which we want to learn a meaningful embedding. For example, in face recognition, the anchor might be a picture of a person's face.
Positive: The positive is another data point that is similar or belongs to the same class as the anchor. In face recognition, this would be another picture of the same person's face.
Negative: The negative is a data point that is dissimilar or belongs to a different class than the anchor. It could be a picture of a different person's face.
The objective of triplet loss is to minimize the distance between the anchor and the positive while simultaneously maximizing the distance between the anchor and the negative. This encourages the embedding space to have a property where the anchor is closer to similar data points (positives) and farther from dissimilar data points (negatives).