Battle of the Text-to-Image Generators

DALL-E 2 vs Stable Diffusion vs Midjourney

Here's your daily briefing:

  • The folks at Every are hard at work putting themselves (and us) out of work. Their new AI writing app (currently waitlist only) is starting to get into the hands of writers around the web and the reaction seems to universally consist of lots of exploding head emojis:

  •  Elad Gil thinks aloud in a new blog post about why AI startups didn't fare so well against incumbents during prior AI waves, and why that may be beginning to change. :

  • If you thought Stable Diffusion going open-source was big deal (it was), you might be excited to hear that CarperAI and Humanloop are currently working on an open-source alternative to GPT-3:

When it comes to consumer or user-focused AI tools, text-to-image generators like Stable Diffusion, DALL-E 2, and Midjourney are capturing much of the collective attention in the AI space right now.

And for good reason. They're pretty incredible:

The "painting" or image above was made using OpenAI's DALL-E 2.

In case you're not familiar with text-to-image generators, the only input required from the user to generate this was the following text prompt: "A van Gogh style painting of an American football player."

The AI model did the rest! 🤯

"DALL·E 2 has learned the relationship between images and the text used to describe them. It uses a process called “diffusion,” which starts with a pattern of random dots and gradually alters that pattern towards an image when it recognizes specific aspects of that image."

Pretty cool.

But what are the similarities and differences between the different models? And why should you choose one over the other?

Let's break it down a bit:

How they're similar:

  • All considered "text-to-image generators".

  • All considered "multimodal models," meaning they can understand multiple types of inputs and outputs such as text and photos.

  • All use what are known as diffusion models to generate images. The linked video goes into varying levels of detail about how they work. But if you want to sound smart at the next cocktail party, just remember: the main idea behind diffusion models comes from principles of thermodynamic equilibrium whereby molecules diffuse from high density to low density areas 🤓.

How they're different:

  • The number of images used to train the model:

    • Stable Diffusion uses 2 billion images, DALL-E 2 uses 400 million, and Midjourney uses tens of millions.

  • The type of images used to train the model:

    • DALL-E 2 has far more stock photos in it's training set, whereas Stable Diffusion and Midjourney are more weighted with images of art.

  • Stable Diffusion is open source, whereas Midjourney and DALL-E 2 use proprietary software.

  • Stable Diffusion, being open source, is unrestricted in nature, meaning it may harbor more risks of being used for nefarious purposes such as deep fakes of celebrities or political figures.

Let's take a look at how these differences lead to different outputs:

"A modern couch designed by Basquiat Realistic Photo, Advertising photography AA"

null

"A beautiful landscape in a Cubo-Futurism style, digital art"

"The creation of the universe in the style of Leonardo Da Vinci"

(Images courtesy of Reddit)

The following three images were created using the prompt: "The crowds at the Black Friday sales at Walmart, a masterpiece painting by Rembrandt van Rijn"

DALL-E 2:

Midjourney:

Stable Diffusion:

So...which one is the best?

Like most things in art and technology, the answer is: it depends.

Some people or commercial entities might appreciate DALL-E 2 for it's more realistic, stock-photo style outputs or its ability to render multiple characters in crisp detail. Others will gravitate toward Midjourney for its artistic style and otherworldly, ethereal outputs. And still more will opt for Stable Diffusion, both for its unique aesthetic as well as its open source ethos, which is sure to fuel an innovation explosion as more and more people remix and refine the model for their specific needs.

We'll leave you with this, from the founder of Stability AI (the team behind Stable Diffusion):

"Different types of model resonate differently...DALL-E 2 is an amazing technological achievement, but it does have a lot of stock photos, and that's great for a whole bunch of things and it resonates in that way. Stable Diffusion has a lot more art in it, and so maybe it's a little more artistic. Midjourney has even more art stuff in it, due to the front-end filters and so it's even more artistic. But they all resonate in different ways. So the artistic process is going to be democratized for a lot of people by being able to express themselves in a way that they haven't before."

-Emad Mostaque, Interdependence podcast

See you tomorrow!