Transformers learn in-context by gradient descent
Training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. Shows how trained Transformers implement gradient descent in their forward pass, allowing for mechanistic understanding of optimized Transformers that learn in-context.
This could lead to improved optimization techniques and increased accuracy in regression problems in various domains.
FlexiViT: One Model for All Patch Sizes
Randomizing patch size during training leads to a model that performs well across a wide range of patch sizes, tailoring the model to different compute budgets at deployment time.
This can increase ease of implementation and make it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture.
Image-and-Language Understanding from Pixels Only
CLIP-Pixels Only (CLIPPO) can perform well on natural language understanding tasks without word-level loss, and obtain good accuracy in visual question answering simply by rendering the question and image together.
This could increase the efficiency and ease of implementation for multimodal models, reducing the need for task- and modality-specific pieces and training procedures.
Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models
Proposes and studies Attributed QA as a key first step in the development of attributed LLMs.
Provides insights on the development of attributed LLMs and proposes a reproducible evaluation framework for Attributed QA.
MAViL: Masked Audio-Video Learners
Presents a self-supervised audio-visual model that outperforms external supervision on AudioSet and VGGSound benchmarks.
Offers a new approach for training audio-visual representations and sets a new SotA on AudioSet and VGGSound benchmarks.