MaMMUT A Simple Architecture for Joint Learning for MultiModal Tasks

This is my reading note for MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks. The paper proposes an efficient multi modality model. it proposes to unify generative loss (masked language modeling) and contrast loss via a two pass training process. One pass is for generate loss which utilizes casual attention model in text decoder and the other pass is bidirectional text decoding. The order of two passes are shuffled during the training.

Read More

FreeU Free Lunch in Diffusion U-Net

This is my reading note for FreeU: Free Lunch in Diffusion U-Net. The paper analyzed the cause of artifact from diffusion model. The paper should that the backbone (U-Net) captures the global or low frequency information and skip connection capture the fine detail or high frequency also shows that the high frequency information causes artifacts. As a results, this paper proposes increasing weight of half channel of U-Net and suppress the low frequency information from the skip connection

Read More

Rerender A Video Zero-Shot Text-Guided Video-to-Video Translation

This is my reading note on Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation. The paper proposes a method to edit a video given style mentioned in prompt. The method performed diffusion to edit key frames and then propagate the edited key frames to other frames using optical flow. For key frame editing, several attention based constraint is applied to reserve details and consistency, including shape aware, style aware, pixel aware and fidelity aware.

Read More