Self Supervision Does Not Help Natural Language Supervision at Scale
Finds that a combination of CLIP + MAE provides a benefit over CLIP when trained on 11.3M image-text pairs, but little to no benefit over CLIP when trained on 1.4B images.
This paper provides insight into the effectiveness of self-supervision for large-scale image-text training.
Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture
Demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations.
This paper proposes a non-generative approach for self-supervised learning from images that produces highly semantic representations and performs well across a wide range of tasks.
Multiview Compressive Coding for 3D Reconstruction
MCC learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder and substantially outperforms the SotA.
This paper proposes a framework for single-view 3D reconstruction that improves upon prior works by learning generalizable representations, resulting in strong generalization to novel objects.