Faithful Chain-of-Thought Reasoning
Faithful CoT outperforms CoT on 9 out of 10 reasoning datasets by decomposing a reasoning task into two stages.
Offers a framework that improves the performance of language models in complex reasoning tasks by combining an LM and a deterministic solver. This approach is demonstrated to be effective on 10 reasoning datasets from 4 diverse domains, achieving new state-of-the-art few-shot performance on 7 out of the 10 datasets.
Mathematical Capabilities of ChatGPT
Presents a new dataset: GHOSTS, which is the first natural-language dataset that covers grad-level math and provides a holistic overview of the mathematical capabilities of language models.
Shows that ChatGPT's mathematical abilities are significantly below those of an average mathematics graduate student. Hence, it is not suitable for use in a university exam, but can still be helpful for some mathematical use cases that come up in the daily professional activities of mathematicians.
The Flan Collection: Designing Data and Methods for Effective Instruction Tuning
Finds that task balancing and enrichment tricks are critical to performance. Makes Flan datasets & templates publicly available.
Provides insights into the design decisions of instruction tuning methods, which can improve the performance of language models in various tasks. It also makes the Flan 2022 collection of datasets, templates, and methods publicly available to accelerate research on instruction tuning.
Grounding Language Models to Images for Multimodal Generation
Efficient method proposed to ground text-only language models to the visual domain, achieving strong zero-shot performance on grounded tasks such as contextual image retrieval and multimodal dialogue.
Enables businesses to process and generate arbitrarily interleaved image-and-text data, providing an effective, general solution for leveraging pretrained language models in visually grounded settings.
Scaling laws for single-agent reinforcement learning
Finds that intrinsic performance scales as a power law in model size and environment interactions, with the optimal model size scaling as a power law in the training compute budget.
Suggests that investing more in training compute may not always lead to better performance and that optimal model size depends on the environment and other properties of the training setup.