The Vision Transformer (ViT) model is an advanced transformer model designed for image classification and serves as the foundation for various computer vision tasks, including object detection. The ViT support package includes three different versions of the model: 1) Base-16, 2) Small-16, and 3) Tiny-16. Each model has been trained on the ImageNet dataset with an input resolution of 384 pixels, and their corresponding pretrained models are stored as .MAT files.
Matlab Code:
clc;
clear all;
close all;
reset(gpuDevice);
gpuDevice(1);
numClass = 100;
net = visionTransformer("tiny-16-imagenet-384"); % Load tiny_ViT model
% net = visionTransformer("small-16-imagenet-384"); % Load small_ViT model
% net = visionTransformer("base-16-imagenet-384"); % Load base_ViT model
lgraph = layerGraph(net); % Extract the layer graph from the trained network and plot the layer graph.
net.Layers(1); % Getting 1st layer information
inputSize = net.Layers(1).InputSize; % getting image size info
lgraph = removeLayers(lgraph, {'head','softmax'});
newLayers = [
fullyConnectedLayer(numClass,'Name','fc','WeightLearnRateFactor',10,'BiasLearnRateFactor',10)
softmaxLayer('Name','softmax')
classificationLayer('Name','classoutput')];
lgraph = addLayers(lgraph,newLayers);
lgraph = connectLayers(lgraph,'cls_index' ,'fc');
analyzeNetwork(net)