Image / Video Gen - a Norm Collection

Note 1. Introduce v_pred. As for DDPM noise scheduler 1.1 definition: v = \sqrt{\bar{\alpha_t}} \epsilon - \sqrt{1-\bar{\alpha_t}} x_0 1.2 The conversion btw epsilon pred and velocity pred: \epsilon_{pred} = \sqrt{\bar{\alpha_t}} v_{pred} + \sqrt{1-\bar{\alpha_t}} x_t

Flow Matching for Generative Modeling

Paper • 2210.02747 • Published Oct 6, 2022 • 2

simple diffusion: End-to-end diffusion for high resolution images

Paper • 2301.11093 • Published Jan 26, 2023 • 2

Note 1. use (v-prediction, epsilon loss) the loss. v_pred = uvit ( z_t , logsnr_t ) eps_pred = sigma_t * z_t + alpha_t * v_t

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Paper • 2209.03003 • Published Sep 7, 2022 • 1

Scalable Diffusion Models with Transformers

Paper • 2212.09748 • Published Dec 19, 2022 • 18

Note 1. Following the U-Net initialization strategy, zero-initializing the final convolutional layer in each block before any residual connections, DiT regresses γ, β, and dimension-wise scaling parameters α that are applied immediately before any residual connections within the DiT block.

SiT: Exploring Flow and Diffusion-based Generative Models with Scalable Interpolant Transformers

Paper • 2401.08740 • Published Jan 16, 2024 • 13

Note 1. Generation Process: (i) Stochastic interpolant framework decouples the formulation of xt from the forward SDE. 2. Model prediction: (i) Learn the velocity field v(x, t) and use it to express the score s(x, t) when using an SDE for sampling. 3. Optimal choice of wt will always be model prediction and interpolant dependent. 4. from a DiT model (discrete, score prediction, VP interpolant) to a SiT model (continuous, velocity prediction, Linear interpolant)

xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper • 2408.12590 • Published Aug 22, 2024 • 36

Note 1. Extend the 2D image-based VAE into a 3D VideoVAE with CausalConv3D. 2. Encode a long video with a divide-and-merge strategy. 3. Caption Model: 3.1 The temporal encoder is implemented with [Token Turing Machines](https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing).

Classifier-Free Diffusion Guidance

Paper • 2207.12598 • Published Jul 26, 2022 • 1

Note 1. Follow-up work: APG(https://arxiv.org/pdf/2410.02416) 1.1 Leaning more on the orthogonal component significantly attenuates this saturation side effect in generations while maintaining the quality-boosting benefits of CFG. 1.2 APG performs best when applied to the denoised predictions rather than the noise prediction.

PixArt-α: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis

Paper • 2310.00426 • Published Sep 30, 2023 • 60

Note 1. Training Receipt - Initialize the T2I model with a low-cost class-condition model; - Pretrain on text-image pair data rich in information density; - Fine-tuning with superior aesthetic quality data; 2. adaLN-single - one global set of shifts and scales is computed only at the first block which is shared across all the blocks, denoted as shared_adaln_cond; - a layer-specific trainable embedding, denoted as adaln_cond; adaptively adjusts the scale and shift parameters in different blocks

FreeInit: Bridging Initialization Gap in Video Diffusion Models

Paper • 2312.07537 • Published Dec 12, 2023 • 26

Note 1. Gap btw training & inference: the initial noises corrupted from real videos remain temporally correlated at the low-frequency band. 2. Free-Init Procedure 2.1 Initialize an independent Gaussian noise; 2.2 DDIM denoising to generate a clean video latent; 2.3 Obtain noisy version video latent through forward diffusion; 2.4 Combine the low-frequency components of this video latent with the high-frequency components from random Gaussian noise; 2.5 Repeat;

black-forest-labs/FLUX.1-schnell

Text-to-Image • Updated Aug 16, 2024 • 876k • • 3.33k

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Paper • 2403.03206 • Published Mar 5, 2024 • 61

Note Known as SD-3 1. Change the distribution over t from the uniform distribution to the one giving more weight to intermediate timesteps by sampling them more frequently. 2. Use a ratio of 50 % original and 50 % synthetic captions. 3. MM-DiT

On the Importance of Noise Scheduling for Diffusion Models

Paper • 2301.10972 • Published Jan 26, 2023 • 1

Note 1. When increasing the image size, the optimal noise scheduling shifts towards a noisier one (due to increased redundancy in pixels). This is more important in video generation.

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Paper • 2402.14797 • Published Feb 22, 2024 • 20

Note 1. Argue that treating spatial and temporal modeling in a separable way causes motion artifacts, temporal inconsistencies, or generation of dynamic images rather than videos with vivid motion. 2. Follow-Up; Mind the Time: https://mint-video.github.io/src/MinT-paper.pdf 2.1 use interval guidance in CFG to mitigate the oversaturation issue

Applying Guidance in a Limited Interval Improves Sample and Distribution Quality in Diffusion Models

Paper • 2404.07724 • Published Apr 11, 2024 • 14

Note 1. guidance is harmful toward the beginning of the chain (high noise levels), largely unnecessary toward the end (low noise levels), and only beneficial in the middle.

Representation Alignment for Generation: Training Diffusion Transformers Is Easier Than You Think

Paper • 2410.06940 • Published Oct 9, 2024 • 7

MarDini: Masked Autoregressive Diffusion for Video Generation at Scale

Paper • 2410.20280 • Published Oct 26, 2024 • 23

Note 1. For Spatio-Temporal Attention, 2D RoPE for spatial & temporal. Insert a learnable [NEXT] token to differentiate image patches across different rows is enough for Spatial. No need for 3D RoPE. 2. Do not include dynamic resolution training in our main training stages. Instead, after convergence, fine-tuning the model for a few steps (10K-20K) with dynamic resolutions enables it.

In-Context LoRA for Diffusion Transformers

Paper • 2410.23775 • Published Oct 31, 2024 • 11

Fluid: Scaling Autoregressive Text-to-image Generative Models with Continuous Tokens

Paper • 2410.13863 • Published Oct 17, 2024 • 37

Note 1. validation loss is a proxy for generation quality.

OminiControl: Minimal and Universal Control for Diffusion Transformer

Paper • 2411.15098 • Published Nov 22, 2024 • 55

Note 1. process condition image tokens uniformly with text and noisy image tokens, integrating them into a unified sequence. Not using the direct addition of hidden states b/c constrains token interactions.

Open-Sora Plan: Open-Source Large Video Generation Model

Paper • 2412.00131 • Published Nov 28, 2024 • 33

Note 1. Retain Full 3D Attention in the first and last two layers. 2. first train a Full 3D Attention model on 256 × 256 images; then inherit the model weights and replace Full 3D Attention with Skiparse Attention 3. adding slight Gaussian noise to the conditional images to enhance generalization during fine-tuning

SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformers

Paper • 2410.10629 • Published Oct 14, 2024 • 11

Note 1. remove the positional embedding in DiT and find no quality loss. 2. AE-F32C32; skip the 256px; gradually fine-tuning the model to 1024px, 2K and 4K 3. Replace T5 with LLM as Text Encoder. Using T5 text embedding as key, value, and image tokens (as the query) for x-attention training results in extreme instability, with training loss frequently becoming NaN.

genmo/mochi-1-preview

Text-to-Video • Updated Dec 18, 2024 • 36.2k • • 1.16k

Playground v3: Improving Text-to-Image Alignment with Deep-Fusion Large Language Models

Paper • 2409.10695 • Published Sep 16, 2024 • 2

Note 1. Token down-sampling at middle layers: reduced the sequence length of the image keys and values by four times in middle layers making the whole network resemble a traditional convolution U-Net with only one level of down sampling. 2. improved these captioning conditions by generating multi-level captions to reduce dataset bias and prevent model overfitting. 3. we looped through the gradients of all model parameters and counted how many gradients exceeded a specific gradient-value threshold.

Spatiotemporal Skip Guidance for Enhanced Video Diffusion Sampling

Paper • 2411.18664 • Published Nov 27, 2024 • 24

STIV: Scalable Text and Image Conditioned Video Generation

Paper • 2412.07730 • Published Dec 10, 2024 • 71

Note 1. As we scale up the spatial resolution, we observe the model producing slow or nearly static motion. 2. Using causal temporal attention also results in a significant drop in both quality and total scores. 3. Using interpolation of the RoPE embeddings yields improved VBench scores compared to extrapolation. 4. Observe staleness happens when we scale our model to 8B with >= 512 resolutions, probably due to the model being more easily overfitting to follow the first frame.

Relay Diffusion: Unifying diffusion process across resolutions for image synthesis

Paper • 2309.03350 • Published Sep 4, 2023

Note pixel-wise noise + patch-wise noise

Lightricks/LTX-Video

Image-to-Video • Updated about 23 hours ago • 96.1k • 923

Note https://arxiv.org/pdf/2501.00103 1. move the patchifying layer to the beginning of the VAE encoder 2. fuses the decoding and denoising steps. 3.1 L2 loss often produces blurry outputs; 3.2 perceptual loss reduces blurriness 4.RoPE with fractional coordinates normalized by predefined maximum coordinates works best.

RepVideo: Rethinking Cross-Layer Representation for Video Generation

Paper • 2501.08994 • Published 21 days ago • 15

Note 1. As layer depth increases, the attention corresponding to each frame’s token becomes more concentrated on the tokens from the same frame, with relatively weaker attention to tokens from other frames. 2. Enhance the model’s ability to interpret text prompts by employing multiple encoders to capture different layers of information, such as semantic level and character-level understanding, thereby improving the alignment between generated content and textual descriptions

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Paper • 2502.02492 • Published about 19 hours ago • 15