What Do Self-Supervised Vision Transformers Learn?
Comparative study on how and why contrastive learning and masked image modeling differ in their representations and performance of downstream tasks.
Insights on how self-supervised Vision Transformers (ViTs) learn and how contrastive learning and masked image modeling can complement each other.
Learning to Reason and Memorize with Self-Notes
A simple method for solving limited context memory and multi-step reasoning problems in large language models by allowing the model to take Self-Notes.
Recommendations for improving the memory and reasoning capabilities of large language models.
Causal Reasoning and Large Language Models: Opening a New Frontier for Causality
Investigation of the causal capabilities of large language models and their implications for societally impactful domains such as medicine, science, law, and policy.
Insights into the causal reasoning capabilities of large language models and their potential impact on various domains.
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
This study presents an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains or scenarios for scene understanding and generation in the physical or virtual world. The approach leverages knowledge-memory to generate scenes in unseen physical world and virtual reality environments, which is validated in scene generation and editing tasks. The potential benefit of incorporating ArK in generative AI for applications such as metaverse and gaming simulation is demonstrated.
ArK can significantly improve the quality of generated 2D/3D scenes, making it a valuable addition to applications such as metaverse and gaming simulation.
GeneFace++: Generalized and Stable Real-Time Audio-Driven 3D Talking Face Generation
The paper presents GeneFace++, a NeRF-based method that achieves stable and real-time talking face generation with generalized audio-lip synchronization. GeneFace++ improves on lip synchronization, video quality, and system efficiency through the utilization of pitch contour as an auxiliary feature, landmark locally linear embedding method to regulate outliers in predicted motion sequence, and efficient NeRF-based motion-to-video renderer. The method outperforms state-of-the-art baselines in terms of subjective and objective evaluation.
GeneFace++ can be used for real-time talking face generation with generalized audio-lip synchronization, improving on previous methods in terms of lip synchronization, video quality, and system efficiency.