VQGAN+CLIP — How does it work?

Alexa Steinbrück
5 min readAug 3, 2021
Early stages of training on the prompt “A high-tech outer circle with a low-tech inner filling trending on art station”

Note: This is a very short high-level introduction. If you’re more interested in code and details, check out my newer blogpost “Explaining the code of the popular text-to-image algorithm (VQGAN+CLIP in PyTorch)”!

The synthetic imagery (“GAN Art”) scene has recently seen a kind of productivity explosion: A new kind of neural network architecture capable of generating images from text was quickly popularized through a freely available Google Colab notebook. It enabled thousands of people to create stunning/fantastic/magical images, just by inputting a text prompt. Twitter, Reddit and other forums were flooded by these images, often accompanied with the hashtags #vqgan or #clip

The text-to-image paradigm that VQGAN+CLIP popularized certainly opens up new ways to create synthetic media and maybe even democratizes “creativity”, by shifting the skillset from (graphical) execution or algorithmic instruction (programming) to nifty “prompt engineering”.

I see VQGAN+CLIP as another cool tool in the “Creative AI” toolbox. It’s time to look at this tool from a technical standpoint and explain how it works!

  1. What is VQGAN+CLIP
  2. Who made VQGAN+CLIP
  3. How does it work technically
  4. What is VQGAN
  5. What is CLIP
  6. How do VQGAN and CLIP work together
  7. What about the training data?
  8. Further reading and cool links

1. What is VQGAN+CLIP?

VQGAN+CLIP is a neural network architecture that builds upon the revolutionary CLIP architecture published by OpenAI in January 2021.

VQGAN+CLIP is a text-to-image model that generates images of variable size given a set of text prompts (and some other parameters).

There have been other text-to-image models before (e.g. AttentionGAN), but the VQGAN+CLIP architecture brings it on a whole new level:

“The crisp, coherent, and high-resolution quality of the images that these tools create differentiate them from AI art tools that have come before (…) These systems are the first ones that actually sort of meet ‘the promise of text-to-image.’”(Vice)

VQGAN+CLIP has launched a new wave of AI-generated art works, as you can follow on Twitter under the hashtags #VQGAN and#CLIP, curated by the Twitter account @images_ai

2. Who made VQGAN+CLIP

Around April 2021 Katherine Crowson aka @RiversHaveWings and Ryan Murdoch aka @advadnoun started doing experiments combining the open-source model CLIP (from OpenAI) and various GAN architectures.

Katherine Crowson, artist and mathematician wrote the Google Colab Notebook that combined VQGAN + CLIP. The notebook was shared a thousand times. It was originally in Spanish and has later been translated into English. Earlier, Ryan Murdoch combined BigGAN + CLIP, which was the inspiration for Crowson’s notebook.

3. How does it work technically?

VQGAN+CLIP is a combination of two neural network architectures: VQGAN and CLIP. Let’s examine these two individually before we look at them in combination.

4. What is VQGAN?

  • a type of neural network architecture
  • VQGAN = Vector Quantized Generative Adversarial Network
  • was first proposed in the paper “Taming Transformers” by University Heidelberg (2020)
  • it combines convolutional neural networks (traditionally used for images) with Transformers (traditionally used for language)
  • it’s great for high-resolution images

Although VQGAN involves Transformers the models are not trained with text, but pure image data. They just apply the Transformer architecture that was previously used for text to images, which is an important innovation.

5. What is CLIP?

  • a model trained to determine which caption from a set of captions best fits with a given image
  • CLIP = Contrastive Language–Image Pre-training
  • it also uses Transformers
  • proposed by OpenAI in Januar 2021
  • Paper: “Learning transferable visual models from natural language supervision”
  • Git Repository: https://github.com/openai/CLIP

Contrary to VQGAN, CLIP is not a generative model. CLIP is “just” trained to represent both text and images very well.

The revolutionary thing about CLIP is that it is capable of zero-shot learning. That means that it performs exceptionally well on previously unseen datasets — Often better than models that have been trained exclusively on a particular dataset!

Funfact:

OpenAI published DALLE (remember the avocado chairs?) at the same time as CLIP. DALLE is an text-to-image model like VQGAN+CLIP. CLIP was open sourced completely, whereas DALLE wasn’t.

“The weights for DALL-E haven’t even been publicly released yet, so you can see this CLIP work as somewhat of a hacker’s attempt at reproducing the promise of DALL-E.” (Source)

6. How do VQGAN and CLIP work together

In one sentence: CLIP guides VQGAN towards an image that is the best match to a given text.

Using the terminology introduced in Katherine Crowson’s notebook, CLIP is the “Perceptor” and VQGAN is the “Generator”.

“CLIP is a model that was originally intended for doing things like searching for the best match to a description like “a dog playing the violin” among a number of images. By pairing a network that can produce images (a “generator” of some sort) with CLIP, it is possible to tweak the generator’s input to try to match a description.” (@advanoun)

It makes sense to look at the inputs and outputs of both models respectively:

VQGAN: Like all GANs VQGAN takes in a noise vector, and outputs a (realistic) image.

CLIP on the other hand takes in:
- (a) an image, and outputs the image features; or
- (b) a text, and outputs text features.
The similarity between image and text can be represented by the cosine similarity of the learnt feature vectors.

By leveraging CLIPs capacities as a “steering wheel”, we can use CLIP to guide a search through VQGAN’s latent space to find images that match a text prompt very well according to CLIP.

Sidenote: Difference to “normal” GANs:

Eventhough both VQGAN and CLIP models are pretrained when you use them in VQGAN, you basically train it (again) for every prompt you give to it. That is different to “normal” GANs where you train it one time (or you use a pretrained model) and then you just do inference in order to generate an image.

7. What about the training data?

In the case of VQGAN+CLIP we deal with 2 models: VQGAN is trained on a mostly canonical dataset like ImageNet or COCO (this depends on the concrete model you use, of course. VQGAN is just the architecture). CLIP on the other hand was trained on a vast (and unknown) dataset of random internet material. Which makes it so exciting but also slightly scary and unpredictable.

8. Further reading and cool links

--

--

Alexa Steinbrück

A mix of Frontend Development, Machine Learning, Musings about Creative AI and more