Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Introduces VALL-E, a LM approach for TTS that outperforms the SotA zero-shot TTS in terms of speech naturalness and speaker similarity.
Can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.
CiT: Curation in Training for Effective Vision-Language Data
Provides >5x speedup and +3.4% accuracy gain over LiT by curating the training dataset and training on the data jointly.
Allows broad data sources including raw image-text pairs from the web and can speed up training by over an order of magnitude, especially if the raw data size is large.
PACO: Parts and Attributes of Common Objects
Introduces PACO, a dataset of object masks that contains 641K part masks annotated across 260K object boxes.
Provides richer annotations such as part masks and attributes and can be used for part mask segmentation, object and part attribute prediction, and zero-shot instance detection.