SeedVR: Seeding Infinity in Diffusion Transformer Towards Generic Video Restoration

1Nanyang Technological University 2ByteDance
Code may be released upon the approval of ByteDance

    Note: The two videos for each demo may be unaligned due to the large video size. Please try to refresh the page.
    Some of the videos are compressed to speed up loading. You can download original video files
    Here
  1. AIGC720P: generate 720P videos (After) from 180P low-quality AIGC inputs (Before).
  2. AIGC2K: generate 2K (2048x1152) videos (After) from 1080P high-quality AIGC inputs generated by SOTA video generation methods (Before).
  3. VideoLQ2K: generate 2K videos (After) from low-quality real-world inputs (Before).

Abstract

Video restoration poses non-trivial challenges in maintaining fidelity while recovering temporally consistent details from unknown degradations in the wild. Despite recent advances in diffusion-based restoration, these methods often face limitations in generation capability and sampling efficiency. In this work, we present SeedVR, a diffusion transformer designed to handle real-world video restoration with arbitrary length and resolution. The core design of SeedVR lies in the shifted window attention that facilitates effective restoration on long video sequences. SeedVR further supports variable-sized windows near the boundary of both spatial and temporal dimensions, overcoming the resolution constraints of traditional window attention. Equipped with contemporary practices, including causal video autoencoder, mixed image and video training, and progressive training, SeedVR achieves highly-competitive performance on both synthetic and real-world benchmarks, as well as AI-generated videos. Extensive experiments demonstrate SeedVR's superiority over existing methods for generic video restoration.

Method

In this work, we present SeedVR, a Diffusion Transformer (DiT) model designed for generic video restoration (VR) that tackles resolution constraints efficiently. We propose a design using large non-overlapping window attention in DiT, which we found effective for achieving competitive VR quality at a lower computational cost. Specifically, SeedVR uses MM-DiT as its backbone and replaces full self-attention with a window attention mechanism. While various window attention designs have been explored, we aim to keep our design as simple as possible and hence use the Swin attention, resulting in Swin-MMDiT. Unlike previous methods, our Swin-MMDiT adopts a significantly larger attention window of 64x64 over an 8x8 compressed latent, compared to the 8x8 pixel space commonly used in window attention for low-level vision tasks. When processing arbitrary input resolutions with Swin-MMDiT using a large window, we can no longer assume that the input spatial dimensions will be multiples of the window size. Additionally, the shifted window mechanism in Swin results in uneven 3D windows near the boundaries of the space-time volume. To address these, we design a 3D rotary position embedding within each window to model the varying-sized windows.

Results