Transformers: Surveys etc
Illustrated Transformer (tutorial)
Transformers with Learnable Activation Functions
https://www.jmlr.org/papers/volume23/21-0998/21-0998.pdf
Reinforcement Learning at Scale
MultiModal Models & MultiModal Transfer
Emergent World Representations: Exploring a sequence model trained on a synthetic task (blog)
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks (comparison of adapters)
AdapterFusion: Non-Destructive Task Composition for Transfer Learning
Unified IO: A Unified Model for Vision, Language, and Multi-Modal Tasks
LoRA: Low-Rank Adaptation of Large Language Models (adapter alternative)
Flamingo: a Visual Language Model for Few-Shot Learning (perceiver resampler)
MAGMA – Multimodal Augmentation of Generative Models through Adapter-based Finetuning (adapters)
Perceiver: General Perception with Iterative Attention Video: Youtube Summary
Learning Transferable Visual Models From Natural Language Supervision
GIT: Generative approach. Simple architecture, very strong results
Datasets/Tasks to test on:
Captioning:
Localized Narratives: (If we want to get really ambitious, we can try training a MAGMA model that does things like pixel level classification, bounding boxes, etc)
ground: Probing Vision and Language Models for Visio-Linguistic Compositionality
Another benchmark:
Data Generation:
Cycle-Consistent Counterfactuals by Latent Transformations
Semantic Segmentation with Diffusion: (only relevant for later generative portions)
Language
Teaching language models to support answers with verified quotes
Improving language models by retrieving from trillions of tokens
Chain of Thought Prompting Elicits Reasoning in Large Language Models
Data governance in the age of large-scale data-driven language technology
From Word Embeddings to Pre-Trained Language Models: A State-of-the-Art Walkthrough
Prompt-Augmented Linear Probing: Scaling Beyond The Limit of Few-shot In-Context Learners
GPT-3 paper: Language Models are Few-Shot Learners (GPT-3 Language Models are Few-Shot Learners (Paper Explained))
Vision
V-PROM: A Benchmark for Visual Reasoning Using Visual Progressive Matrices
Vision-language pre-training: Basics, recent advances, and future trends
Time-series Transformers
Multivariate Time Series Forecasting with Latent Graph Inference
ETSformer (https://arxiv.org/abs/2202.01381)
Pyraformer (https://openreview.net/pdf?id=0EXmFzUn5I)
Informer (https://arxiv.org/abs/2012.07436)
Reformer (https://arxiv.org/pdf/2001.04451.pdf)
N-HiTS (https://arxiv.org/pdf/2201.12886.pdf)
Autoformer (https://arxiv.org/pdf/2106.13008.pdf)
LogTrans (https://arxiv.org/pdf/1907.00235.pdf)
GLR local global ts representations (https://arxiv.org/pdf/2202.02262.pdf)
TACTiS (https://arxiv.org/pdf/2202.03528.pdf)
MQTransformer (https://arxiv.org/pdf/2009.14799.pdf)
ProTran (https://proceedings.neurips.cc/paper/2021/file/c68bd9055776bf38d8fc43c0ed283Paper.pdf)
Preformer (https://arxiv.org/pdf/2202.11356.pdf)
Spacetimeformer (https://arxiv.org/pdf/2109.12218.pdf)