Thu Dec 15 2022 - Top Trending AI Papers

Transformers learn in-context by gradient descent

Gradient-Based Meta-learning

Neural Network Architecture

Machine Learning

Regression

Training Transformers on auto-regressive tasks can be closely related to well-known gradient-based meta-learning formulations. Shows how trained Transformers implement gradient descent in their forward pass, allowing for mechanistic understanding of optimized Transformers that learn in-context.

This could lead to improved optimization techniques and increased accuracy in regression problems in various domains.

https://arxiv.org/pdf/2212.07677.pdf

https://arxiv.org/abs/2212.07677

https://twitter.com/arankomatsuzaki/status/1603567302523621377/photo/1

FlexiViT: One Model for All Patch Sizes

Vision Transformers

Computer Vision

Machine Learning

Classification

Image-Text Retrieval

Open-World Detection

Panoptic Segmentation

Semantic Segmentation

Randomizing patch size during training leads to a model that performs well across a wide range of patch sizes, tailoring the model to different compute budgets at deployment time.

This can increase ease of implementation and make it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture.

https://arxiv.org/pdf/2212.08013.pdf

https://arxiv.org/abs/2212.08013

https://twitter.com/arankomatsuzaki/status/1603565272140505089/photo/1

Image-and-Language Understanding from Pixels Only

Multimodal Models

Computer Vision

Natural Language Processing

Natural Language Understanding

Multilingual Multimodal Retrieval

CLIP-Pixels Only (CLIPPO) can perform well on natural language understanding tasks without word-level loss, and obtain good accuracy in visual question answering simply by rendering the question and image together.

This could increase the efficiency and ease of implementation for multimodal models, reducing the need for task- and modality-specific pieces and training procedures.

https://arxiv.org/pdf/2212.08045.pdf

https://arxiv.org/abs/2212.08045

https://twitter.com/arankomatsuzaki/status/1603563602098769920/photo/1

Attributed Question Answering: Evaluation and Modeling for Attributed Large Language Models

large language models

natural language processing

Proposes and studies Attributed QA as a key first step in the development of attributed LLMs.

Provides insights on the development of attributed LLMs and proposes a reproducible evaluation framework for Attributed QA.

https://arxiv.org/pdf/2212.08037.pdf

https://arxiv.org/abs/2212.08037

https://twitter.com/arankomatsuzaki/status/1603564230908825600/photo/1

MAViL: Masked Audio-Video Learners

self-supervised learning

audio-visual representation learning

Presents a self-supervised audio-visual model that outperforms external supervision on AudioSet and VGGSound benchmarks.

Offers a new approach for training audio-visual representations and sets a new SotA on AudioSet and VGGSound benchmarks.

https://arxiv.org/pdf/2212.08071.pdf

https://arxiv.org/abs/2212.08071

https://twitter.com/arankomatsuzaki/status/1603562724881010688/photo/1