Tue Apr 18 2023 - Top Trending AI Papers

Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models

Latent Diffusion Models

Video generation

Simulation of in-the-wild driving data

Creative content creation with text-to-video modeling

Personalized text-to-video generation

This paper discusses how Latent Diffusion Models (LDMs) can be utilized for high-resolution video generation by pre-training an LDM on images only and introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences to generate videos. The approach was validated on real driving videos of resolution 512 x 1024 and was able to achieve state-of-the-art performance. The generated temporal layers can generalize to different fine-tuned text-to-image LDMs, allowing for personalized text-to-video generation.

Businesses can use this research to improve their video generation capabilities and enhance customer engagement through personalized video content.

https://arxiv.org/pdf/2304.08818.pdf

https://arxiv.org/abs/2304.08818

https://twitter.com/i/web/status/1648505582188388352

https://twitter.com/_akhaliq/status/1648505582188388352/video/1

Generative Disco: Text-to-Video Generation for Music Visualization

Text-to-image models

Video generation

Music visualization

This paper discusses Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-image models. The system allows users to select intervals of music for visualization and parameterize that visualization by defining start and end prompts, which are then generated based on the beat of the music for audioreactive video. Design patterns for improving generated videos are also introduced. A study with professionals showed that the system was enjoyable, easy to explore, and highly expressive.

Businesses in the music industry can use this research to improve their music visualization capabilities and enhance customer engagement through visually appealing content.

https://arxiv.org/pdf/2304.08551.pdf

https://arxiv.org/abs/2304.08551

https://twitter.com/_akhaliq/status/1648512503176138752/photo/1

Text2Performer: Text-Driven Human Video Generation

Diffusion-based motion sampler

Video generation

Human-centric video generation

This paper presents Text2Performer, a system that generates human videos with articulated motions from texts describing the appearance and movements of a target performer. The system utilizes decomposed human representation and diffusion-based motion sampler to maintain appearance and generate continuous pose embeddings for better motion modeling. The paper also introduces a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Results show that Text2Performer generates high-quality human videos with diverse appearances and flexible motions.

Businesses can use this research to improve their video generation capabilities and enhance customer engagement through visually appealing and personalized content.

https://yumingj.github.io/projects/Text2Performer.html

https://github.com/yumingj/Text2Performer

https://arxiv.org/pdf/2304.08483.pdf

https://arxiv.org/abs/2304.08483

https://twitter.com/arankomatsuzaki/status/1648495330697297921/video/1

Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model

Motion tracking

Computer vision

Machine learning

AR/VR applications

This paper presents AGRoL, a novel conditional diffusion model specifically designed to track full bodies given sparse upper-body tracking signals. It can predict accurate and smooth full-body motion, particularly the challenging lower body movement. The model outperforms state-of-the-art methods in generated motion accuracy and smoothness.

The AGRoL model can improve the realism and accuracy of 3D full-body avatars for AR/VR applications, making it a highly demanded feature. It can be useful for businesses that develop AR/VR solutions.

https://arxiv.org/pdf/2304.08577.pdf

https://arxiv.org/abs/2304.08577

https://dulucas.github.io/agrol/

https://twitter.com/_akhaliq/status/1648503785000685569/video/1

Hyperbolic Image-Text Representations

Multi-modal data

Computer vision

Natural language processing

MERU is a contrastive model that yields hyperbolic representations of images and text, capturing the underlying hierarchy in image-text data. The model learns a highly interpretable representation space while being competitive with CLIP's performance on multi-modal tasks like image classification and image-text retrieval.

MERU can improve the interpretability of visual and linguistic concepts by explicitly capturing the hierarchy in image-text data, making it useful for businesses that work with large-scale vision and language models.

https://arxiv.org/pdf/2304.09172.pdf

https://arxiv.org/abs/2304.09172

https://twitter.com/_akhaliq/status/1648516062613499906/photo/1