WhisperX: Time-Accurate Speech Transcription of Long-Form Audio
Improves transcription quality and enables a 12x transcription speedup via batched inference.
WhisperX can refine the timestamps of openAI's Whisper model via forced alignment with phoneme-based ASR models, enabling more accurate and faster speech transcription. This can be useful for businesses that rely on transcriptions for various purposes, such as captioning videos, generating meeting minutes, or analyzing customer feedback.
StraIT: Non-autoregressive Generation with Stratified Image Transformer
Significantly improves non-AR generation and outperforms existing diffusion models and AR methods while being order-of-magnitude faster.
StraIT is a generative model that can synthesize high-quality images faster and with fewer constraints than existing autoregressive and diffusion models, making it a potential tool for businesses that rely on image generation, such as graphic design or marketing agencies. Its decoupled modeling process also allows for domain transfer, providing more flexibility in image synthesis.
Unlimited-Size Diffusion Restoration
Can be used not only for image restoration but also for image generation of unlimited sizes, with the potential to be a general tool for diffusion models.
This research proposes a simple approach to use diffusion models for zero-shot image restoration and generation of unlimited sizes, which can be a useful tool for businesses that need to process large amounts of visual data, such as media companies or e-commerce platforms. The proposed Hierarchical Restoration and Mask-Shift Restoration techniques can maintain the excellent characteristics of zero-shot while alleviating local incoherence and out-of-domain issues.