DALL-E 2 shows the power of generative deep learning, but raises dispute over AI practices

OpenAI’s new text-to-image generator has raised some controversy

The beauty of DALL-E 2

Like other milestone OpenAI announcements, DALL-E 2 comes with adetailed paperand aninteractive blog postthat shows how the machine learning model works. There’s also a video that provides an overview of what the technology is capable of doing and what its limitations are.

DALL-E 2 is a “generative model,” a special branch of machine learning that creates complex output instead of performing prediction or classification tasks on input data. You provide DALL-E 2 with a text description, and it generates an image that fits the description.

Generative models are a hot area of research that received much attention with the introduction ofgenerative adversarial networks(GAN) in 2014. The field has seen tremendous improvements in recent years, and generative models have been used for a vast variety of tasks, including creating artificial faces,deepfakes, synthesized voices, and more.

However, what sets DALL-E 2 apart from other generative models is its capability to maintain semantic consistency in the images it creates.

For example, the following images (from the DALL-E 2 blog post) are generated from the description “An astronaut riding a horse.” One of the descriptions ends with “as a pencil drawing” and the other “in photorealistic style.”

The model remains consistent in drawing the astronaut sitting on the back of the horse and holding his/her hands in front. This kind of consistency shows itself in most examples OpenAI has shared.

The following examples (also from OpenAI’s website) show another feature of DALL-E 2, which is to generate variations of an input image. Here, instead of providing DALL-E 2 with a text description, you provide it with an image, and it tries to generate other forms of the same image. Here, DALL-E maintains the relations between the elements in the image, including the girl, the laptop, the headphones, the cat, the city lights in the background, and the night sky with moon and clouds.

Other examples suggest that DALL-E 2 seems to understand depth and dimensionality, a great challenge for algorithms that process 2D images.

Even if the examples on OpenAI’s website were cherry-picked, they are impressive. And the examples shared on Twitter show that DALL-E 2 seems to have found a way to represent and reproduce the relationships between the elements that appear in an image, even when it is “dreaming up” something for the first time.

In fact, to prove how good DALL-E 2 is, Altman took to Twitter and asked users to suggest prompts to feed to the generative model. The results (see the thread below) are fascinating.

The science behind DALL-E 2

DALL-E 2 takes advantage of CLIP and diffusion models, two advanced deep learning techniques created in the past few years. But at its heart, it shares the same concept as all otherdeep neural networks: representation learning.

Ideally, the machine learning model should be able to learn latent features that remain consistent across different lighting conditions, angles, and background environments. But as has often been seen, deep learning models often learn the wrong representations. For example, a neural network might think that green pixels are a feature of the “sheep” class because all the images of sheep it has seen during training contain a lot of grass. Another model that has been trained on pictures of bats taken during the night might consider darkness a feature of all bat pictures and misclassify pictures of bats taken during the day. Other models might become sensitive to objects being centered in the image and placed in front of a certain type of background.

Learning the wrong representations is partly why neural networks are brittle, sensitive to changes in the environment, and poor at generalizing beyond their training data. It is also why neural networks trained for one application need to befinetuned for other applications— the features of the final layers of the neural network are usually very task-specific and can’t generalize to other applications.

In theory, you could create a huge training dataset that contains all kinds of variations of data that the neural network should be able to handle. But creating and labeling such a dataset would require immense human effort and is practically impossible.

This is the problem thatContrastive Learning-Image Pre-training(CLIP) solves. CLIP trains two neural networks in parallel on images and their captions. One of the networks learns the visual representations in the image and the other learns the representations of the corresponding text. During training, the two networks try to adjust their parameters so that similar images and descriptions produce similar embeddings.

One of the main benefits of CLIP is that it does not need its training data to be labeled for a specific application. It can be trained on the huge number of images and loose descriptions that can be found on the web. Additionally, without the rigid boundaries of classic categories, CLIP can learn more flexible representations and generalize to a wide variety of tasks. For example, if an image is described as “a boy hugging a puppy” and another described as “a boy riding a pony,” the model will be able to learn a more robust representation of what a “boy” is and how it relates to other elements in images.

CLIP has already proven to be very useful forzero-shot and few-shot learning, where a machine learning model is shown on-the-fly to perform tasks that it hasn’t been trained for.

The other machine learning technique used in DALL-E 2 is “diffusion,” a kind of generative model that learns to create images by gradually noising and denoising its training examples.Diffusion models are like autoencoders, which transform input data into an embedding representation and then reproduce the original data from the embedding information.

DALL-E trains a CLIP model on images and captions. It then uses the CLIP model to train the diffusion model. Basically, the diffusion model uses the CLIP model to generate the embeddings for the text prompt and its corresponding image. It then tries to generate the image that corresponds to the text.

Disputes over deep learning and AI research

For the moment, DALL-E 2 will only be made available to a limited number of users who have signed up for the waitlist. Since the release ofGPT-2, OpenAI has been reluctant to release its AI models to the public. GPT-3, its most advanced language model, is only availablethrough an API interface. There’s no access to the actual code and parameters of the model.

OpenAI’s policy of not releasing its models to the public has not rested well with the AI community and has attracted criticism from some renowned figures in the field.

DALL-E 2 has also resurfaced some of the longtime disagreements over the preferred approach towardartificial general intelligence. OpenAI’s latest innovation has certainly proven that with the right architecture and inductive biases, you can still squeeze more out of neural networks.

Proponents of pure deep learning approaches jumped on the opportunity to slight their critics, including a recent essay by cognitive scientist Gary Marcus titled, “Deep Learning is Hitting a Wall.” Marcus endorses ahybrid approachthat combines neural networks with symbolic systems.

Based on the examples that have been shared by the OpenAI team, DALL-E 2 seems to manifest some of the commonsense capabilities that haveso long been missing in deep learningsystems. But it remains to be seen how deep this commonsense and semantic stability goes, and how DALL-E 2 and its successors will deal with more complex concepts such as compositionally.

The DALL-E 2 paper mentions some of the limitations of the model in generating text and complex scenes. Responding to the many tweets directed his way, Marcus pointed out that the DALL-E 2 paper in fact proves some of the points he has been making in his papers and essays.

Some scientists have pointed out that despite the fascinating results of DALL-E 2, some of the key challenges of artificial intelligence remain unsolved. Melanie Mitchell, Professor of Complexity at the Santa Fe Institute and author ofArtificial Intelligence: A Guide For Thinking Humans, raised some important questions in a Twitter thread.

Mitchell referred toBongard problems, a set of challenges that test the understanding of concepts such as sameness, adjacency, numerosity, concavity/convexity, and closedness/openness.

“We humans can solve these visual puzzles due to our core knowledge of basic concepts and our abilities of flexible abstraction and analogy,” Mitchell tweeted. “If such an AI system were created, I would be convinced that the field is making real progress on human-level intelligence. Until then, I will admire the impressive products of machine learning and big data, but will not mistake them for progress toward general intelligence.”

The business case for DALL-E 2

Since switching from non-profit to a “capped profit” structure, OpenAI has been trying tofind the balancebetween scientific research and product development. The company’s strategic partnership with Microsoft has given it solid channels to monetize some of its technologies, includingGPT-3andCodex.

In ablogpost, Altman suggested a possible DALL-E 2 product launch in the summer. Many analysts are already suggesting applications for DALL-E 2, such as creating graphics for articles (I could certainly use some for mine) and doing basic edits on images. DALL-E 2 will enable more people to express their creativity without the need for special skills with tools.

Altman suggests that advances in AI are taking us toward “a world in which good ideas are the limit for what we can do, not specific skills.”

In any case, the more interesting applications of DALL-E will surface as more and more users tinker with it. For example,the idea for Copilot and Codexemerged as users started using GPT-3 to generate source code for software.

If OpenAI releases a paid API service a la GPT-3, then more and more people will be able to build apps with DALL-E 2 or integrate the technology into existing applications. Butas was the case with GPT-3, building a business model around a potential DALL-E 2 product will have its own unique challenges. A lot of it will depend on the costs of training and running DALL-E 2, the details of which have not been published yet.

And as the exclusive license holder to GPT-3’s technology,Microsoft will be the main winnerof any innovation built on top of DALL-E 2 because it will be able to do it faster and cheaper. Like GPT-3, DALL-E 2 is a reminder that as the AI community continues to gravitate toward creatinglarger neural networks trained on ever-larger training datasets, power will continue to be consolidated in a few very wealthy companies that have the financial and technical resources needed for AI research.

This article was originally published by Ben Dickson onTechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. You can read the original articlehere.

Story byBen Dickson

Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook(show all)Ben Dickson is the founder ofTechTalks. He writes regularly about business, technology and politics. Follow him onTwitterandFacebook

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

More TNW

About TNW

Can OpenAI’s Strawberry program deceive humans?

French startup Poolside nears $3B valuation for AI that can write code

Discover TNW All Access

German startup OroraTech raises €25M to scale wildfire early warning system

AI is changing science: Google DeepMind duo win Nobel Prize in Chemistry