Large Language Models Can Be Easily Distracted by Irrelevant Context
Large language models have achieved impressive performance on various natural language processing tasks. However, so far they have been evaluated primarily on benchmarks where all information in the input context is relevant for solving the task. In this work, we investigate the distractibility of large language models, i.e., how the model problem-solving accuracy can be influenced by irrelevant context.
This research highlights the importance of considering irrelevant context in evaluating the performance of large language models, and proposes approaches for mitigating this deficiency, such as decoding with self-consistency and adding to the prompt an instruction that tells the language model to ignore the irrelevant information.
Improving Few-Shot Generalization by Exploring and Exploiting Auxiliary Data
Few-shot learning involves learning an effective model from only a few labeled datapoints. The use of a small training set makes it difficult to avoid overfitting but also makes few-shot learning applicable to many important real-world settings. In this work, we focus on Few-shot Learning with Auxiliary Data (FLAD), a training paradigm that assumes access to auxiliary data during few-shot learning in hopes of improving generalization.
This research proposes automated sampling strategies for few-shot learning with auxiliary data and compares them with methods that either explore or exploit, finding that the combination of exploration and exploitation is crucial. The proposed algorithms yield a significant improvement over existing pre-trained models on 11 datasets.
CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models
Code generation models based on the pre-training and fine-tuning paradigm have been increasingly attempted by both academia and industry, resulting in well-known industrial models such as Codex, CodeGen, and PanGu-Coder. To validate the performance of these models, multiple existing benchmarks (e.g., AiXBench and HumanEval) are proposed, including only cases of generating a standalone function, i.e., a function that invokes or accesses only built-in functions and standard libraries.
This research proposes a benchmark named CoderEval of pragmatic code generation with generative pre-trained models, which can assess the performance of models against pragmatic code generation beyond just generating standalone functions. The evaluation of three public available models on CoderEval provides insights into the progress and future directions of pragmatic code generation with a generative pre-trained model.
Learning Universal Policies via Text-Guided Video Generation
Casts the sequential decision making problem as a text-conditioned video generation problem, where, a planner synthesizes a set of future frames, from which actions are extracted.
This research offers a new approach to constructing more general-purpose AI agents that can solve a wide variety of tasks. By leveraging text as the underlying goal specification, the proposed policy-as-video formulation offers combinatorial generalization to novel goals and can represent environments with different state and action spaces in a unified space of images. The approach enables knowledge transfer through predicting highly realistic video plans for real robots.