CLIP Itself is a Strong Fine-tuner: Achieving 85.7% and 88.0% Top-1 Accuracy with ViT-B and ViT-L on ImageNet
Demonstrates that CLIP is better or at least competitive in fine-tuning compared with supervised pre-training approaches or CLIP + MIM.
Provides insights into the hyper-parameters of CLIP fine-tuning and challenges the conventional conclusion that CLIP isn't suitable for fine-tuning.
MAGVIT: Masked Generative Video Transformer
Outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models.
Recommends MAGVIT for various video synthesis tasks with a single model and highlights its quality, efficiency, and flexibility.
The Stable Artist: Steering Semantics in Diffusion Latent Space
Presents the Stable Artist to enable control by allowing the artist to steer the diffusion process along a variable number of semantic directions.
Suggests the Stable Artist for image editing and composition and highlights its ability to provide fine-grained control of the image generation process.