Don’t mistake OpenAI Codex for a programmer

OpenAI Codex is a powerful tool for programmers but won’t take their jobs

The “no free lunch” theorem

Codex is a descendent ofGPT-3, a massive deep learning language model release last year. The complexity ofdeep learning modelsis often measured by the number of parameters they have. In general, a model’s learning capacity increases with the number of parameters. GPT-3 came with 175 billion parameters, more than two orders of magnitude larger than its predecessor,GPT-2(1.5 billion parameters). GPT-3 was trained on more than 600 gigabytes, more than 50 times larger than GPT-2’s training dataset.

Aside from the huge increase in size, the main innovation of GPT-3 was “few-shot learning,” the capability to perform tasks it wasn’t trained for. Thepaper that introduced GPT-3was titled “Language Models are Few-Shot Learners” and stated: “Here we showthat scaling up language models greatly improves task-agnostic, few-shot performance[emphasis mine], sometimes even reaching competitiveness with prior state-of-the-art fine-tuning approaches.”

Basically, the premise was a large-enough model trained on a large corpus of text can match or outperform several models that are specialized for specific tasks.

But according to the new paper by OpenAI, none of the various versions of GPT-3 were able to solve any of the coding problems used to evaluate Codex. To be fair, there were no coding samples in GPT-3’s training dataset, so we can’t expect it to be able to code. But the OpenAI scientists also tested GPT-J, a 6 billion-parameter model trained onThe Pile, an 800-gigabyte dataset that includes 95 gigabytes of GitHub and 32 gigabytes of StackExchange data. Opesolved 11.4 percent of the coding problems. Codex, a version of GPT-3’s 12-billion parameter fine-tuned on 159 gigabytes of code examples from GitHub, solved 28.8 percent of the problems. A separate version of Codex, called Codex-S, which was fine-tuned through supervised learning boosted the performance to 37.7 percent (other GPT and Codex models are trained throughunsupervised learning).

Codex proves that machine learning is still ruled by the “no free lunch” theorem (NFL), which means that generalization comes at the cost of performance. In other words, machine learning models are more accurate when they are designed to solve one specific problem. On the other hand, when their problem domain is broadened, their performance decreases.

Codex can perform one specialized task (transforming function descriptions and signatures into source code) with high accuracy at the cost of poornatural language processingcapabilities. On the other hand, GPT-3 is a general language model that can generate decent text about a lot of topics (including complicated programming concepts) but can’t write a single line of code.

Size vs cost

The experiments of OpenAI’s researchers show that the performance of Codex improved as they increased the size of the machine learning model. At 300 million parameters, Codex solved 13.2 percent of the evaluation problems against the 28.8 percent performance of the 12-billion-parameter model.

But the full version of GPT-3 is 175 billion parameters, a full order of magnitude larger than the one used to create Codex. Wouldn’t training the larger model on the Codex training data yield better results?

One probable reason for stopping at 12 billion could be the dataset size. A larger Codex model would need a larger dataset. Training it on the 159-gigabyte corpus would probably cause overfitting, where the model becomes very good at memorizing and rehearsing its training examples and very bad at dealing with novel situations. Gathering and maintaining larger datasets is an expensive and time-consuming process.

An equally vexing problem would be the cost of Codex. Aside from a scientific experiment, Codex was supposed to become the backbone of a future product that can turn in profits for a research lab that isquasi-ownedby a commercial entity. As I’ve already discussed before, the costs of training and running the 175-billion GPT-3 model would make it very hard to developa profitable business modelaround it.

However, a smaller but fine-tuned version of GPT-3 would be much more manageable in terms of profits and losses.

Finally, as OpenAI’s experiments show, Codex’s size/performance ratio follows a logarithmic scale. This means that performance gains gradually reduce as you increase the size of the model. Therefore, the added costs of gathering data and training and running the larger model might not be worth the small performance boost.

And note that code generation is a very lucrative market. Given the high hourly salaries of programmers, even saving a few hours’ worth of coding time per month would be enough to cover the subscription fees of Codex. In other domains where labor is less expensive, automating tasks with large language models will be more challenging from a profit and loss perspective.

Generating vs understanding code

This article was originally published by Ben Dickson onTechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new tech, and what we need to look out for. You can read the original articlehere.

Responsible use and reporting of AI

As I said after the release of Copilot, “AI Pair Programmer,” the term used on GitHub’s webpage for Copilot, is inaccurate.

Codex is not a programmer. And it’s also not going to take your job (if you’re a programmer). Coding is just part of what programmers do. OpenAI’s scientists observe that in its current state Codex “may somewhat reduce the cost of producing software by increasing programmer productivity,” but it won’t replace the other tasks that software developers regularly do, such as “conferring with colleagues, writing design specifications, and upgrading existing software stacks.”

Mistaking Codex for a programmer can also lead to “over-reliance,” where a programmer blindly approves any code generated by the model without revising it. Given the obvious and subtle mistakes Codex can make, overlooking this threat can entail quality and security risks. “Human oversight and vigilance is required for safe use of code generation systems like Codex,” OpenAI’s researchers warn in their paper.

Overall, the reaction of the programmer community shows that Codex is a very useful tool with a possibly huge impact on the future of the software industry. At the same time, given the hype surrounding the release of Copilot, it is important to understand its unwanted implications. In this regard, it is worth commending the folks at OpenAI for responsibly studying, documenting, and reporting the limits and threats of Codex.

Story byBen Dickson

Ben Dickson is the founder of TechTalks. He writes regularly about business, technology and politics. Follow him on Twitter and Facebook(show all)Ben Dickson is the founder ofTechTalks. He writes regularly about business, technology and politics. Follow him onTwitterandFacebook

Get the TNW newsletter

Get the most important tech news in your inbox each week.

Also tagged with

More TNW

About TNW

Here’s what happened when we let an AI write a movie script

UK-based DeepMind merges with Google Brain in transatlantic AI tie-up

Discover TNW All Access

European Central Bank assembles ‘infinity team’ to identify GenAI applications

Meta’s AI chief: LLMs will never reach human-level intelligence