Generative Novel View Synthesis with 3D-Aware Diffusion Models
A diffusion-based model is presented for 3D-aware generative novel view synthesis from a single input image, using a 3D feature volume as a latent feature field to improve generation of view-consistent novel renderings, as well as synthesize 3D-consistent sequences.
Offers potential for generating diverse and plausible novel views from limited input, and can aid in tasks such as virtual product prototyping.
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
This paper presents a framework using an encoder to capture high-level identifiable semantics of objects and produce an object-specific embedding for image generation, which is then passed to a text-to-image synthesis model using a joint training scheme to preserve object identity. The proposed method allows for generation of diverse content and styles, conditioned on both text and objects, without the need for test-time optimization.
Has potential to improve product customization and personalization in e-commerce and retail.
HNeRV: A Hybrid Neural Representation for Videos
HNeRV is a hybrid neural representation for videos that uses a learnable encoder to generate content-adaptive embeddings for decoding, allowing for better internal generalization and regression capacity. The model outperforms implicit methods in video regression tasks and shows decoding advantages for speed, flexibility, and deployment, compared to traditional codecs and learning-based compression methods. The effectiveness of HNeRV is explored on downstream tasks such as video compression and video inpainting.
Has potential to improve video processing tasks for businesses such as video compression and editing.
Evaluation of Large Language Models in Arithmetic Tasks
Large language models have shown abilities in solving math word problems but there is no work that focuses on evaluating their arithmetic ability. In this paper, the authors propose an arithmetic dataset to test the latest large language models and provide a detailed analysis of their arithmetic ability.
The paper provides insights on the ability of large language models in solving arithmetic problems which can be useful in developing AI-powered solutions for automated arithmetic tasks.
Bimodality Driven 3D Dance Generation via Music-Text Integration
The paper proposes a novel task for generating 3D dance movements that incorporate both text and music modalities. The authors propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, and a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements. The paper also introduces two novel metrics to measure the coherence and freezing percentage of the generated motion.
The paper presents a new approach to generate 3D dance movements that integrate both text and music modalities, which can be useful in developing AI-powered solutions for the entertainment industry.