VideoDex: Learning Dexterity from Internet Videos
Proposes using internet videos of humans using their hands as real-world experience to guide robot behavior. Shows strong results on various manipulation tasks, outperforming various state-of-the-art methods.
Implement VideoDex algorithm to leverage visual, action, and physical priors from human video datasets to guide robot behavior in various manipulation tasks.
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis
Introduces MoFusion, a denoising-diffusion-based framework for high-quality conditional human motion synthesis that generates long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts. Can be used for several interactive motion editing applications, providing crucial abilities for virtual character animation and robotics.
Use MoFusion framework to generate high-quality and semantically accurate human motion in virtual character animation and robotics, and apply it for interactive motion editing applications such as inbetweening, seed conditioning, and text-based editing.
Learning Video Representations from Large Language Models
Introduces LaViLa, an approach to learning video-language representations by leveraging Large Language Models (LLMs). Repurposes pre-trained LLMs to be conditioned on visual input and finetunes them to create automatic video narrators. Outperforms previous SotA on multiple first-person and third-person video tasks.
Use LaViLa to create automatic video narrators for improved video-text embeddings, and leverage them to improve performance on first-person and third-person video tasks.