Sun Jan 08 2023
Thu Jan 05 2023

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Speech Synthesis
Language Modeling
TTS

Introduces VALL-E, a LM approach for TTS that outperforms the SotA zero-shot TTS in terms of speech naturalness and speaker similarity.

Can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

CiT: Curation in Training for Effective Vision-Language Data

Data Curation
Training Algorithms
Vision-Text Learning

Provides >5x speedup and +3.4% accuracy gain over LiT by curating the training dataset and training on the data jointly.

Allows broad data sources including raw image-text pairs from the web and can speed up training by over an order of magnitude, especially if the raw data size is large.

PACO: Parts and Attributes of Common Objects

Computer Vision
Dataset Creation
Object Annotation
Object Detection

Introduces PACO, a dataset of object masks that contains 641K part masks annotated across 260K object boxes.

Provides richer annotations such as part masks and attributes and can be used for part mask segmentation, object and part attribute prediction, and zero-shot instance detection.

Mon Jan 02 2023
Wed Dec 28 2022
Mon Dec 26 2022
Sun Dec 25 2022