Thu Jan 05 2023 - Top Trending AI Papers

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Speech Synthesis

Language Modeling

TTS

Introduces VALL-E, a LM approach for TTS that outperforms the SotA zero-shot TTS in terms of speech naturalness and speaker similarity.

Can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt.

https://valle-demo.github.io/

https://arxiv.org/pdf/2301.02111.pdf

https://arxiv.org/abs/2301.02111

https://twitter.com/arankomatsuzaki/status/1611174058699395072/photo/1

CiT: Curation in Training for Effective Vision-Language Data

Data Curation

Training Algorithms

Vision-Text Learning

Provides >5x speedup and +3.4% accuracy gain over LiT by curating the training dataset and training on the data jointly.

Allows broad data sources including raw image-text pairs from the web and can speed up training by over an order of magnitude, especially if the raw data size is large.

https://arxiv.org/pdf/2301.02241.pdf

https://arxiv.org/abs/2301.02241

https://twitter.com/arankomatsuzaki/status/1611172857077460992/photo/1

PACO: Parts and Attributes of Common Objects

Computer Vision

Dataset Creation

Object Annotation

Object Detection

Introduces PACO, a dataset of object masks that contains 641K part masks annotated across 260K object boxes.

Provides richer annotations such as part masks and attributes and can be used for part mask segmentation, object and part attribute prediction, and zero-shot instance detection.

https://paco.metademolab.com/

https://github.com/facebookresearch/paco

https://arxiv.org/pdf/2301.01795.pdf

https://arxiv.org/abs/2301.01795

https://twitter.com/arankomatsuzaki/status/1611176018039177217/photo/1