Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Visual ChatGPT allows the user to interact with ChatGPT by sending and receiving not only languages but also images, providing complex visual questions or visual editing instructions that require the collaboration of multiple AI models with multi-steps, and providing feedback and asking for corrected results.
Visual ChatGPT can improve customer experience by allowing them to communicate with the company through a co-creative process that includes visual elements and AI models, potentially improving the effectiveness and accuracy of the interaction.
Magnushammer: A Transformer-based Approach to Premise Selection
Magnushammer is a neural transformer-based approach that can outperform traditional symbolic systems in solving the fundamental problem of automated theorem proving, achieving a 59.5% proof rate compared to a 38.3% proof rate of Sledgehammer, the most mature and popular symbolic-based solver.
Magnushammer can increase the efficiency and accuracy of automated theorem proving tasks, potentially saving time and resources in fields such as mathematics and computer science.
Open-Vocabulary Panoptic Segmentation with Text-to-Image Diffusion Models
ODISE is a system that unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation, achieving significant improvements over the previous state of the art in both open-vocabulary panoptic and semantic segmentation tasks.
ODISE can improve the accuracy and quality of image segmentation tasks, potentially benefiting industries such as e-commerce and advertising that rely on visual content.
TRACT: Denoising Diffusion Models with Transitive Closure Time-Distillation
Improves FID of single step diffusion by up to 2.4x and achieves new single-step DDIM SotA FID (7.4 for ImageNet64).
Can improve generative sampling through denoising diffusion models with a new method called TRAnsitive Closure Time-distillation (TRACT). This method extends binary time-distillation (BTD) and can improve the FID of single-step diffusion by up to 2.4x on the same architecture. The PyTorch implementation will be released soon. It can be used to generate good samples with fewer iterations in processes that require generative sampling.
X-Avatar: Expressive Human Avatars
A new avatar model that captures the full expressiveness of digital humans to bring about life-like experiences in telepresence, AR/VR and beyond.
X-Avatar is a new avatar model that can capture the full expressiveness of digital humans to create life-like experiences in telepresence, AR/VR, and other areas. It can be learned from either full 3D scans or RGB-D data and models bodies, hands, facial expressions, and appearance in a holistic fashion. X-Avatar outperforms strong baselines in both data domains both quantitatively and qualitatively on the animation task. To facilitate research on expressive avatars, the authors also contribute a new dataset called X-Humans, containing 233 sequences of high-quality textured scans from 20 participants, totalling 35,500 data frames.