Spatio-temporal attention based blocks are used to predict depth and velocity. Leveraging the low inductive bias of a transformer, the patches in space and across time were stacked to directly regress for velocity exploiting the use of multiple multi-scale attention heads. Multiscale Transformers progressively expand the channel capacity, while pooling the resolution from input to output of the network. As shown in Figure 9, MViT in contrast to Multi head attention, pools the sequence of latent tensors to reduce the sequence length. The pooling operator performs a pooling kernel computation along each dimensions. The pooling operator is applied to all intermediate tensors (key, query and value) independently, yielding pre-attention vectors. Attention is then computed using the shortened vectors.
The pooling factor provides edge during the training period and rest of process is similar to the ViT model. Our MViT comprised of 4 scale stages, each having Multi Head Pooling Attention and MLP layers. The MViT projects the input to a channel dimensions of 96 with overlapping space time cubes of shape 3 x 7 x 7. The resulting sequence is reduced further by a factor of 4 for each additional stage and we modified the final linear layer to regress depth and velocity for every four and eight frames. The input includes the confidence score for detections and their corresponding detections and velocities. The idea behind directly regressing for depth and velocity, excluding keypoints is to make use of other non Autonomous Vehicle based - depth prediction datasets to satisfy the data hunger of transformers.
Figure 9: (a) Network architecture proposed for MViT. (b) Instantiations used in the MViT