How to Adapt Your Large-Scale
Vision-and-Language Model