VQGAN: Taming Transformers for High Resolution Image Synthesis Paper Explained

Views: 8

The authors introduce VQGAN which combines the efficiency of convolutional approaches with the expressivity of transformers. VQGAN is essentially a GAN that learns a codebook of contextrich visual parts and uses it to quantize the bottleneck representation at every forward pass. The selfattention model is used to learn a prior distribution of codewords. Sampling from this model then allows producing plausible constellations of the codewords which are then fed through a decoder to generate realistic images in arbitrary resolution. Paper: Code: (with pretrained models) Colab notebook: Colab notebook to compare the first stage models in VQGAN and in DALLE: Abstract: