Binarized Neural Machine Translation
Proposes a novel binarization technique for Transformers applied to machine translation (BMT) that leverages additional LayerNorms and residual connections to improve binarization quality. Experiments show that a one-bit weight-only Transformer can achieve the same quality as a float one, while being 16x smaller in size.
Implementing BMT can improve machine translation models by reducing their size without compromising performance.
Scaling Vision Transformers to 22 Billion Parameters
Presents a recipe for highly efficient and stable training of a 22B-parameter ViT (ViT-22B) and demonstrates improving performance, fairness, robustness, and alignment with scale.
Scaling ViT-22B can lead to improvements in image and video modelling for various use cases.
The Wisdom of Hindsight Makes Language Models Better Instruction Followers
Proposes a novel algorithm, Hindsight Instruction Relabeling (HIR), for aligning language models with instructions, which converts feedback to instruction by relabeling the original one and training the model for better alignment in a supervised manner. HIR outperforms the baseline algorithms and is comparable to supervised finetuning on 12 challenging BigBench reasoning tasks.
Implementing HIR can improve alignment between language models and instructions without additional training pipelines.