Sun Dec 18 2022
Thu Dec 15 2022

Transformers learn in-context by gradient descent

Gradient-Based Meta-learning
Neural Network Architecture
Machine Learning
Regression

Training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. Shows how trained Transformers implement gradient descent in their forward pass, allowing for mechanistic understanding of optimized Transformers that learn in-context.

This could lead to improved optimization techniques and increased accuracy in regression problems in various domains.

FlexiViT: One Model for All Patch Sizes

Vision Transformers
Computer Vision
Machine Learning
Classification
Image-Text Retrieval
Open-World Detection
Panoptic Segmentation
Semantic Segmentation

Randomizing patch size during training leads to a model that performs well across a wide range of patch sizes, tailoring the model to different compute budgets at deployment time.

This can increase ease of implementation and make it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture.

Image-and-Language Understanding from Pixels Only

Multimodal Models
Computer Vision
Natural Language Processing
Natural Language Understanding
Multilingual Multimodal Retrieval

CLIP-Pixels Only (CLIPPO) can perform well on natural language understanding tasks without word-level loss, and obtain good accuracy in visual question answering simply by rendering the question and image together.

This could increase the efficiency and ease of implementation for multimodal models, reducing the need for task- and modality-specific pieces and training procedures.

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

large language models
natural language processing

Proposes and studies Attributed QA as a key first step in the development of attributed LLMs.

Provides insights on the development of attributed LLMs and proposes a reproducible evaluation framework for Attributed QA.

MAViL: Masked Audio-Video Learners

self-supervised learning
audio-visual representation learning

Presents a self-supervised audio-visual model that outperforms external supervision on AudioSet and VGGSound benchmarks.

Offers a new approach for training audio-visual representations and sets a new SotA on AudioSet and VGGSound benchmarks.

Wed Dec 14 2022
Tue Dec 13 2022
Mon Dec 12 2022
Sun Dec 11 2022