Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta Optimizers
Explains language models as meta-optimizers and understands ICL as a kind of implicit finetuning.
Insights on the working mechanism of ICL and the potential to utilize this understanding for future model designing.
Self-Instruct: Aligning Language Model with Self Generated Instructions
Provides an almost annotation-free method for aligning pre-trained language models with instructions.
Improvement of instruction-following capabilities of language models without relying on extensive manual annotation.
Pretraining Without Attention
Proposes Bidirectional Gated SSM (BiGS) which replicates BERT pretraining results without attention.
Employs recent routing layers based on state-space models (SSM) and multiplicative gating model architectures.
SODA: Million-scale Dialogue Distillation with Social Commonsense Contextualization
Distills 1.5M socially-grounded dialogues from InstructGPT, which human evaluation shows are more consistent, specific, and natural than prior human-authored datasets. Trains a generalizable conversation agent outperforming previous best-performing agents on both in- and out-of-domain datasets.
Can use SODA to train a conversation agent that performs better than previous best-performing agents on both in- and out-of-domain datasets. SODA dialogues are more consistent, specific, and natural than prior human-authored datasets. Can contextualize social commonsense knowledge from a knowledge graph to distill dialogue. COSMO (trained using SODA) is significantly more natural and consistent on unseen datasets than best-performing dialogue models.
Character-Aware Models Improve Visual Text Rendering
Trains a suite of image generation models, and show that character-aware variants outperform their character-blind counterparts across a range of novel text rendering tasks. Investigates the extent to which popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs.
Character-aware models provide large gains on a novel spelling task and outperform their character-blind counterparts on a range of text rendering tasks. Investigates the extent to which popular text-to-image models lack character-level input features, making it much harder to predict a word's visual makeup as a series of glyphs. These models can set a higher state-of-the-art on visual spelling and can achieve 30+ point accuracy gains over competitors on rare words despite training on fewer examples.