NÜ WA: Visual Synthesis Pre training for Neural vis Ual World cre Ation ( ML Research Paper Explained)
, nuwa, microsoft, generative NÜWA is a unifying architecture that can ingest text, images, and videos and brings all of them into a quantized latent representation to support a multitude of visual generation tasks, such as texttoimage, textguided video manipulation, or sketchtovideo. This paper details how the encoders for the different modalities are constructed, and how the latent representation is transformed using their novel 3D nearby selfattention layers. Experiments are shown on 8 different visual generation tasks that the model supports. OUTLINE: 0:00 Intro Outline 1:20 Sponsor: ClearML 3:35 Tasks Naming 5:10 The problem with recurrent image generation 7:35 Creating a shared latent space w, Vector Quantization 23:20 Transforming the latent representation 26:25 Recap: Self and CrossAttention 28:50 3D Nearby SelfAttention 41:20 PreTraining Objective 46:05 Experimental Results 50:40 Conclusion Comments Paper: Github: github