Language Is Not All You Need: Aligning Perception with Language Models
Presents KOSMOS-1, a multimodal LLM that is capable of perceiving multimodal input, following instructions, and performing in-context learning for multimodal tasks.
This paper introduces KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). It achieves impressive performance on language understanding, generation, and even OCR-free NLP as well as perception-language tasks and vision tasks. It also introduces a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.
Internet Explorer: Targeted Representation Learning on the Open Web
Internet Explorer explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a target dataset.
Internet Explorer proposes dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. It explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30-40 hours. Results, visualizations, and videos are available.
SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks
Competitive with non-spiking models on tested benchmarks, while maintaining 5x less energy consumption on neuromorphic hardware.
SpikeGPT is a generative language model with pure binary, event-driven spiking activation units. It is trained on three model variants with 45M, 125M, and 260M parameters. It remains competitive with non-spiking models on tested benchmarks while maintaining 5x less energy consumption when processed on neuromorphic hardware that can leverage sparse, event-driven activations. The code implementation and pre-trained model on BookCorpus are available.
Directed Diffusion: Direct Control of Object Placement through Attention Guidance
The paper proposes a method called Directed Diffusion, which provides positional control over multiple objects in image generation using text prompts.
Businesses can use this method to generate high-quality images for storytelling and marketing purposes. It can help them save time and resources in image creation.