These AI bots took your captions and made creepy photos out of them

The images are unsettling but they indicate a significant step forward in artificial intelligence research.

OpenAI's GPT-3 has been unsettling an already-unsettled world for a few months now. It recently "blogged" and "wrote" an article for The Guardian and although the syntax and grammar sound stunted and unnatural, it told the reader right off the bat, "I taught myself everything I know just by reading the internet, and now I can write this column. My brain is boiling with ideas." The internet, as we all know, is a terrible place to learn about ideas.

The same GPT-3 is now playing around with visuals, mainly in the form of photos. According to the MIT Technology Review, the Allen Institute for Artificial Intelligence, otherwise known as A12, has been training its text-and-image model (more conventionally known as a visual language model) to generate pictures by reading captions.

It takes millions of text samples for a model to even remotely grasp the "essence" of it. Subsequently, these models have been tasked with generating pictures. They're not exactly what you or I would think of as successes. In an experiment to see what the model does, here is what it produced when asked for a picture of "three people play video games while sitting on a couch." Yikes.


Bots are getting smarter — The entire approach is a reversal of conventional training strategies where researchers give these models both text and images to essentially fill in the blanks. Google did this with its BERT model by teaching it "masking." Through this approach, BERT would have to read an incomplete sentence and try to complete it using textual and visual information.

When done millions of times, these models are able to predict sentences such as, "She went to the ___ to exercise" alongside a photo of a woman inside a gym. The model will then suggest "gym" for the missing word, if it has been successfully trained on mountains of data. With this A12 project, the bot is only given a caption. So it is entirely dependent on textual information and has to subsequently provide its understanding of the caption.

The results are hilarious, unnerving, but nonetheless proof of the rapid advances underway in artificial intelligence research. This is what it conjures when I run the caption: "A man attempting to ski on a flat hill."


At a broader level, this project shows that with consistent and rigorous training, artificial intelligence does have some ability to receive abstract information and provide tangible results. Down the line, if this is applied to robotics, we might see more complex communication and comprehension skills from bots. For now, though, we get nightmare-inducing art. That's the price of progress.