Thu Jul 07 2022
Sun Jul 03 2022

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Artificial Intelligence
Efficient Inference
Transformer Models
Natural Language Processing
Image Recognition
Speech Recognition

DeepSpeed Inference is a comprehensive system solution for transformer model inference that reduces latency and increases throughput by leveraging CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS.

Businesses that use transformer models for latency or throughput-oriented scenarios can benefit from DeepSpeed Inference by reducing latency and increasing throughput. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference.

Conditional Generation with a Question-Answering Blueprint

Artificial Intelligence
Conditional Generation
Text Planning
Summarization
Machine Translation
Content Generation

The work proposes a new conceptualization of text plans as a sequence of question-answer (QA) pairs to enhance existing datasets for conditional generation. The blueprint models are more factual than alternatives which do not resort to planning and allow tighter control of the generation output.

Businesses that rely on conditional generation can benefit from using QA blueprints to improve the conditional generation quality of trained models. The use of planning as an intermediate representation for rendering conditional generation less opaque and more grounded can ensure that the generated output is relevant and faithful.

Thu Jun 30 2022
Wed Jun 29 2022
Tue Jun 28 2022
Mon Jun 27 2022