Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models
This paper discusses how Latent Diffusion Models (LDMs) can be utilized for high-resolution video generation by pre-training an LDM on images only and introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences to generate videos. The approach was validated on real driving videos of resolution 512 x 1024 and was able to achieve state-of-the-art performance. The generated temporal layers can generalize to different fine-tuned text-to-image LDMs, allowing for personalized text-to-video generation.
Businesses can use this research to improve their video generation capabilities and enhance customer engagement through personalized video content.
Generative Disco: Text-to-Video Generation for Music Visualization
This paper discusses Generative Disco, a generative AI system that helps generate music visualizations with large language models and text-to-image models. The system allows users to select intervals of music for visualization and parameterize that visualization by defining start and end prompts, which are then generated based on the beat of the music for audioreactive video. Design patterns for improving generated videos are also introduced. A study with professionals showed that the system was enjoyable, easy to explore, and highly expressive.
Businesses in the music industry can use this research to improve their music visualization capabilities and enhance customer engagement through visually appealing content.
Text2Performer: Text-Driven Human Video Generation
This paper presents Text2Performer, a system that generates human videos with articulated motions from texts describing the appearance and movements of a target performer. The system utilizes decomposed human representation and diffusion-based motion sampler to maintain appearance and generate continuous pose embeddings for better motion modeling. The paper also introduces a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Results show that Text2Performer generates high-quality human videos with diverse appearances and flexible motions.
Businesses can use this research to improve their video generation capabilities and enhance customer engagement through visually appealing and personalized content.
Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model
This paper presents AGRoL, a novel conditional diffusion model specifically designed to track full bodies given sparse upper-body tracking signals. It can predict accurate and smooth full-body motion, particularly the challenging lower body movement. The model outperforms state-of-the-art methods in generated motion accuracy and smoothness.
The AGRoL model can improve the realism and accuracy of 3D full-body avatars for AR/VR applications, making it a highly demanded feature. It can be useful for businesses that develop AR/VR solutions.
Hyperbolic Image-Text Representations
MERU is a contrastive model that yields hyperbolic representations of images and text, capturing the underlying hierarchy in image-text data. The model learns a highly interpretable representation space while being competitive with CLIP's performance on multi-modal tasks like image classification and image-text retrieval.
MERU can improve the interpretability of visual and linguistic concepts by explicitly capturing the hierarchy in image-text data, making it useful for businesses that work with large-scale vision and language models.