Inpaint Anything: Segment Anything Meets Image Inpainting
Users can select any object in an image by clicking on it. With powerful vision models, e.g., SAM, LaMa and Stable Diffusion (SD), Inpaint Anything is able to remove the object smoothly (i.e., Remove Anything). Further, prompted by user input text, Inpaint Anything can fill the object with any desired content (i.e., Fill Anything) or replace the background of it arbitrarily (i.e., Replace Anything).
Inpaint Anything can be used for mask-free image inpainting and provides a user-friendly interface for solving inpainting-related problems. Businesses can leverage this technology to improve their image editing processes and workflows.
Multimodal C4: An Open, Billion-scale Corpus of Images Interleaved With Text
Multimodal C4 is a corpus of 103M documents containing 585M images interleaved with 43B English tokens. It spans everyday topics like cooking, travel, technology, etc., and can support in-context vision and language models like Flamingo. Businesses can leverage this dataset for training and evaluating their vision and language models.
Multimodal C4 provides a large-scale dataset for training and evaluating in-context vision and language models that can support various business operations, such as image and text analysis, natural language processing, and recommendation systems.
Delta Denoising Score
Delta Denoising Score (DDS) is a novel scoring function for text-based image editing that guides minimal modifications of an input image towards the content described in a target prompt. DDS utilizes the Score Distillation Sampling (SDS) mechanism for the purpose of image editing and can be used as a loss term in an optimization problem to steer an image towards a desired direction dictated by a text. DDS outperforms existing methods in terms of stability and quality, highlighting its potential for real-world applications in text-based image editing.
DDS can be used for text-based image editing and can guide minimal modifications of an input image towards a desired direction dictated by a text. Businesses can leverage this technology to improve their image editing processes and workflows.
Shall We Pretrain Autoregressive Language Models with Retrieval? A Comprehensive Study
Presents RETRO++, which significantly outperforms retrieval-augmented GPT across different model sizes.
Pretraining autoregressive LMs with retrieval can lead to better text generation quality and downstream task accuracy. RETRO++ outperforms retrieval-augmented GPT across different model sizes.
Soundini: Sound-Guided Diffusion for Natural Video Editing
Proposes a method for adding sound-guided visual effects to specific regions of videos with a zero-shot setting.
Sound-guided natural video editing using denoising diffusion probabilistic models and audio latent representation is a promising direction for creating more realistic visual effects. Optical flow-based guidance ensures temporal consistency between adjacent frames.