Sun Mar 19 2023
Thu Mar 16 2023

ART: Automatic multi-step reasoning and tool-use for large language models

Large Language Models
AI for Natural Language Processing
AI for Automation
Natural Language Processing
Task Automation
Reasoning

Automatic Reasoning and Tool-use (ART) framework uses frozen LLMs to automatically generate a program for intermediate reasoning steps. ART selects demonstrations of multi-step reasoning and tool use from a task library to solve new tasks at test time, achieving substantial improvement over few-shot prompting and automatic CoT on unseen tasks in BigBench and MMLU benchmarks.

Can improve performance on unseen tasks in benchmarks, matches performance of hand-crafted CoT prompts on majority of these tasks, and is extensible for humans to improve performance by incorporating new tools or correcting errors in task-specific programs.

SemDeDup: Data-efficient learning at web-scale through semantic deduplication

Pre-trained Models
Embeddings
AI for Data Quality Assurance
AI for Machine Learning Efficiency
Data Quality Assurance
Data Processing
Machine Learning Efficiency

SemDeDup leverages embeddings from pre-trained models to identify and remove semantic duplicates from large uncurated web-scale datasets, effectively halving training time with minimal performance loss. It provides an example of how quality embeddings can be used to make models learn faster with less data.

Can remove 50% of the data with minimal performance loss, effectively halving training time; improves performance out of distribution; and provides efficiency gains when training language models on partially curated datasets.

Fate/Zero: Fusing Attentions for Zero-shot Text-based Video Editing

Diffusion Models
Attention Mechanisms
AI for Video Editing
AI for Creative Industries
Video Editing
Content Creation
Creative AI

FateZero is the first zero-shot framework for text-driven video editing via pre-trained diffusion models without training. It captures intermediate attention maps during inversion, which effectively retains both structural and motion information, and fuses them in the editing process rather than generating them during denoising. It also introduces spatial-temporal attention to ensure frame consistency.

Can edit videos consistently without per-prompt training or use-specific mask; shows the ability of zero-shot text-driven video style and local attribute editing from trained text-to-image model; and has better zero-shot shape-aware editing ability based on the text-to-video model.

A Picture is Worth a Thousand Words: Language Models Plan from Pixels

Planning and decision making
Artificial intelligence
Natural language processing
Embodied visual environments
Plan sequences
Pre-trained language models

Exploring the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments, showing that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM.

Implement pre-trained language models for planning in embodied visual environments in order to improve performance and accuracy of planning processes.

LERF: Language Embedded Radiance Fields

Vision-language models
Artificial intelligence
Natural language processing
Robotics
Language embeddings
3D CLIP embeddings
Open-ended language queries in 3D

Proposes Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enables open-ended language queries in 3D. LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, supporting long-tail open-vocabulary queries hierarchically across the volume.

Use LERFs to enable pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume.

Tue Mar 14 2023
Mon Mar 13 2023
Sun Mar 12 2023
Thu Mar 09 2023