Explaining the code of the popular text-to-image algorithm (VQGAN+CLIP in PyTorch)

General facts about this notebook

  • It uses PyTorch, a popular machine learning framework written in Python
  • It connects two existing (open-source, pretrained) models: CLIP (OpenAI) and VQGAN (Esser et al. from Heidelberg University)
  • It is structured in the following sections/cells:
    — Setup, Installing libraries
    — Selection of models to download
    — Loading libraries and definitions
    — Implementation tools
    — Execution

A high-level overview of the algorithm

A high level overview of the VQGAN+CLIP architecture (image licenced under CC-BY 4.0)

A core concept: Inference-by-optimization

  • Training is the optimization process of finding the right weights of your model in order to minimize a loss function.
  • Inference is the process of using a pre-trained model to make predictions

Variable naming choices and what they refer to

  • Perceptor → CLIP model
  • Model(also sometimes named the “Generator”) → VQGAN model
  • Prompt → the model we’re going to train when we run the notebook
  • z → A vector as input for VQGAN for synthesizing an image
  • iii → A batch of CLIP-encoded image cutouts

The notebook step by step

STEP 0. Downloading the pre-trained models (CLIP & VQGAN)

STEP 1. Generating the initial z vector (Cell “Excecution”, line 29–36)

STEP 2. Initializing the optimizer with z (Cell “Execution”, line 39)

STEP 3. Instantiating the Prompt models for every text prompt (Cell “Execution”, line 46–49)

STEP 4. The actual optimization Loop (Cell “Execution” line 134–144)

The actual optimization procedure

More detailed view on the inference/optimization process: forward pass + backward pass. (image licenced under CC-BY 4.0)

⛰️ What happens in ascend_txt ?

🔥 Is it CLIP? Is it VQGAN? What exactly is being trained or optimized in this notebook?

🔥 The Prompt class

✂️ MakeCutouts

Other interesting aspects of this notebook (Steganography, etc.)

🕵️‍♀️ Steganography

A little dictionary




Loss function

One-hot encoding




Vector Quantization


Cool Resources



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Alexa Steinbrück

Alexa Steinbrück

A mix of Frontend Development, Machine Learning, Musings about Creative AI and more