Tue Feb 28 2023
Mon Feb 27 2023

Language Is Not All You Need: Aligning Perception with Language Models

large language models
machine learning
neuroscience
natural language processing
computer vision
artificial general intelligence

Presents KOSMOS-1, a multimodal LLM that is capable of perceiving multimodal input, following instructions, and performing in-context learning for multimodal tasks.

This paper introduces KOSMOS-1, a Multimodal Large Language Model (MLLM) that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). It achieves impressive performance on language understanding, generation, and even OCR-free NLP as well as perception-language tasks and vision tasks. It also introduces a dataset of Raven IQ test, which diagnoses the nonverbal reasoning capability of MLLMs.

Internet Explorer: Targeted Representation Learning on the Open Web

large datasets
computer vision
natural language processing
image recognition
self-supervised learning
web scraping

Internet Explorer explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a target dataset.

Internet Explorer proposes dynamically utilizing the Internet to quickly train a small-scale model that does extremely well on the task at hand. It explores the web in a self-supervised manner to progressively find relevant examples that improve performance on a desired target dataset. It outperforms or matches CLIP oracle performance by using just a single GPU desktop to actively query the Internet for 30-40 hours. Results, visualizations, and videos are available.

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

spiking neural networks
machine learning
neuroscience
natural language generation
neuromorphic computing
energy-efficient deep learning

Competitive with non-spiking models on tested benchmarks, while maintaining 5x less energy consumption on neuromorphic hardware.

SpikeGPT is a generative language model with pure binary, event-driven spiking activation units. It is trained on three model variants with 45M, 125M, and 260M parameters. It remains competitive with non-spiking models on tested benchmarks while maintaining 5x less energy consumption when processed on neuromorphic hardware that can leverage sparse, event-driven activations. The code implementation and pre-trained model on BookCorpus are available.

Directed Diffusion: Direct Control of Object Placement through Attention Guidance

Image and text generation
Computer Vision
Natural Language Processing
Image generation for storytelling and marketing

The paper proposes a method called Directed Diffusion, which provides positional control over multiple objects in image generation using text prompts.

Businesses can use this method to generate high-quality images for storytelling and marketing purposes. It can help them save time and resources in image creation.

Sun Feb 26 2023
Thu Feb 23 2023
Tue Feb 21 2023
Mon Feb 20 2023