ART: Automatic multi-step reasoning and tool-use for large language models
Automatic Reasoning and Tool-use (ART) framework uses frozen LLMs to automatically generate a program for intermediate reasoning steps. ART selects demonstrations of multi-step reasoning and tool use from a task library to solve new tasks at test time, achieving substantial improvement over few-shot prompting and automatic CoT on unseen tasks in BigBench and MMLU benchmarks.
Can improve performance on unseen tasks in benchmarks, matches performance of hand-crafted CoT prompts on majority of these tasks, and is extensible for humans to improve performance by incorporating new tools or correcting errors in task-specific programs.
SemDeDup: Data-efficient learning at web-scale through semantic deduplication
SemDeDup leverages embeddings from pre-trained models to identify and remove semantic duplicates from large uncurated web-scale datasets, effectively halving training time with minimal performance loss. It provides an example of how quality embeddings can be used to make models learn faster with less data.
Can remove 50% of the data with minimal performance loss, effectively halving training time; improves performance out of distribution; and provides efficiency gains when training language models on partially curated datasets.
Fate/Zero: Fusing Attentions for Zero-shot Text-based Video Editing
FateZero is the first zero-shot framework for text-driven video editing via pre-trained diffusion models without training. It captures intermediate attention maps during inversion, which effectively retains both structural and motion information, and fuses them in the editing process rather than generating them during denoising. It also introduces spatial-temporal attention to ensure frame consistency.
Can edit videos consistently without per-prompt training or use-specific mask; shows the ability of zero-shot text-driven video style and local attribute editing from trained text-to-image model; and has better zero-shot shape-aware editing ability based on the text-to-video model.
A Picture is Worth a Thousand Words: Language Models Plan from Pixels
Exploring the use of pre-trained language models (PLMs) to reason about plan sequences from text instructions in embodied visual environments, showing that PLMs can accurately plan even when observations are directly encoded as input prompts for the PLM.
Implement pre-trained language models for planning in embodied visual environments in order to improve performance and accuracy of planning processes.
LERF: Language Embedded Radiance Fields
Proposes Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enables open-ended language queries in 3D. LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, supporting long-tail open-vocabulary queries hierarchically across the volume.
Use LERFs to enable pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume.