Generating images of AI with AI — Part 1: AttnGAN + COCO

🧪 This is the start of a series of experiments that was sparked by the question if we can create (better) pictures of Artificial Intelligence by using AI techniques to generate them. The question emerged from a bigger research project in which I am involved which is researching ways to improve the visual representation of AI in the media: “Better images of AI”.

⚗️ Which AI techniques are most suited to generate these images? How do we deal with the biases engraved in these systems? Can we expose and unpack the infrastructure and human labour behind the datasets and algorithms that led to these images and make it transparent in the image making process, rather than hiding the human labour and pretending AI systems are autonomous?

1. What is AttentionGAN?

The AttentionGAN model I will use is capable of transforming text to images. That’s an interesting capability for our task.

AttentionGAN for text-to-image generation is a neural network architecture proposed in the paper AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks by Tao Xu et al. (in 2019).

A GAN (Generative Adversarial Network) is a type of neural network architecture that leverages a machine learning strategy called “adversarial training”. A GAN consists of two networks, the discriminator and the generator, both becoming better over time. After training, the generator can be used to generate new images (or whatever domain it was trained on).

Attention is a technique in machine learning that can be applied to many fields, from computer vision to natural language processing. (Transformer networks are extensively using the attention mechanism). The main idea with attention is that some parts of the data are more important than others.

“AttnGAN can synthesize fine-grained details at different subregions of the image by paying attentions to the relevant words in the natural language description” say the authors in the abstract.

Good to know: A pretrained AttentionGAN model can be used conveniently inside RunwayML, a platform for creative AI.

2. Choosing prompts and generating images

In order to come up with text input I searched for “Artificial Intelligence” on common stock photo platforms and collected the captions under these images. Here are some examples:

  • “3d rendering robot learning or solving problems”
  • “artificial intelligence”
  • “artificial intelligence digital concept with brain shape”
  • “Electricity flowing through computer printed circuitboard style brain graphic”
  • “Hands of robot and human touching”
  • “mechanical robot”
  • “Robot checking brain testing result with computer interface, futuristic human brain analysis, innovative technology in science and medicine concept”
  • “white humanoid robot”

(I find that the wordings of these stock photo captions themselves have a unique style to them..)

These are a selection of results when I feed the captions into AttentionGAN. Multiple images per prompt have been generated by appending white space to the input text.

“Artificial Intelligence”

Input text: “Artificial Intelligence”
Input text: “artificial intelligence digital concept with brain shape”

“Robots”

Input text: “3d rendering robot learning or solving problems”
Input text: “Hands of robot and human touching”
Input text: “Robot checking brain testing result with computer interface, futuristic human brain analysis, innovative technology in science and medicine concept”
Input text: “mechanical robot”
Input text: “white humanoid robot”

3. A closer look at the dataset: COCO

After looking at these generated images I asked myself what the dataset looks like that this model was pre trained with. Did the words “robot” or “artificial intelligence” even exist in this dataset? That’s a crucial question. So I needed to find out what the RunwayML model was trained with.

Did the words “robot” or “AI” even exist in the COCO dataset?

tldr; After doing some research I concluded that the pre-trained AttentionGAN model in RunwayML seems to have been trained on the COCO dataset.

A sidenote about RunwayML and model information: When it comes to pretrained models unfortunately RunwayML doesn’t tell you much about which data they were pre trained with. I asked inside the RunwayML Slack and they just answered “datasets are private by default”. In a technical sense this is true: Public models on RunwayML don’t expose or share the dataset they were trained on. But from a user/artist perspective I found this unsatisfying: I am interested in what domain a certain model was trained on so I can reason about the output and if it is a good fit for an artistic idea I have in mind. I found it a bit confusing that RunwayML doesn’t put more effort into this transparency aspect of machine learning. The ml5 community for example takes this aspect very seriously: Ellen Nickles has started the “model and data provenance” project, where she collects information about the background of each model and dataset.

COCO is a famous dataset widely used in AI research. It was first proposed in 2015 (Paper). COCO stands for “Common objects in context”. According to the authors it consists of “images of complex everyday scenes containing common objects in their natural context”. COCO contains 91 object types and a total of 328.000 images.

Screenshot of the website of the COCO dataset and pictograms of its 91 object categories

It is important to highlight the human labour behind the COCO dataset: The images inside the COCO dataset were gathered from the Flickr photo platform where (mostly) amateur photographers upload their images. The images have been labeled and were provided with captions by Amazon Mechanical Turk workers. The process of how the image captions were collected is described in this paper. It contains interesting details of the instructions AMT workers were given, e.g.: “Describe all the important parts of the scene.”, “Do not describe what a person might say.”, “Do not give people proper names.”

3.1. AI-related content in the COCO dataset

The terms “AI” or “artificial intelligence” were not present in the captions or the 91 object categories of the COCO dataset.

However, the word “robot” appeared in the image captions of 35 images (there are 69 total occurrences of the word “robot” because every image has multiple captions), though it did not exist within the 91 categories of COCO.

3.2. Robots in COCO are of a very special kind…

To my surprise the robots in these 35 images were not the kinds of robots you would expect, as summarized in my tweet here:

A big part of the “robot” titled images on COCO show simple DIY toys that look like robots, but are made out of cardboard or other materials. Other examples show everyday objects that are painted or manipulated to resemble robots.

A couple of images — but they are the minority — show actual (functioning) robots in university labs and on soccer fields as part of the famous Robocup competition. This shows that the importance of robotics in the dataset is more a pop-cultural one than a scientific one.

4. Evaluation of COCO for the task of generating images of AI

Since COCO is a dataset focused on real world objects, it is probably not a good candidate for generating images about an abstract topic such as AI. Although robots and robotics are somehow related to the topic of AI, robotics and AI remain two distinct fields of research (with only some overlap). It is a common myth that AI deals with (humanoid) robots.

What struck me was how I tried to recognize “AI-ness” in the generated images although the word “AI” was not present in the dataset at all and thus the model had no ways of learning how to model any relationship of the word/concept “AI” and its visual features.

5. Evaluation of AttentionGAN for the task of generating images of AI

A text-to-image model like AttentionGAN proved to be an interesting starting point in creating images of AI. By using a different dataset than COCO to train this model architecture, e.g. a dataset of images and captions of more abstract themes, we could potentially achieve more appropriate results.

‘Thanks for reading’ mumbled the humanoid robot and left the scene to the right ;-)

Thanks for reading!

Frontend Dev with background in Artificial Intelligence and the Fine Arts. Voice assistants, NLP, Computational Creativity