GPTs, software engineering, and a new age of hacking

Jim Waldo and Angela Wu

ChatGPT and other natural language models have recently sparked considerable intrigue and unease. Governments and businesses are increasingly acknowledging the role of Generative Pre-trained Transformers (GPTs) in shaping the cybersecurity landscape. This article discusses the implications of using GPTs in software development and the potential impact on cybersecurity in the age of artificial intelligence (AI). While GPTs can improve efficiency and productivity for programmers, they will not replace human programmers due to the complex decision-making processes involved in programming beyond simply writing code. And while they may help in finding shallow bugs to prevent short-lived vulnerabilities, GPTs are unlikely to change the balance of power between offense and defense in cybersecurity.


Generative Pre-trained Transformers (GPTs) are the technology of the moment. From GPT-based like Chat-GPT and Bard to programming assistants like Copilot, this newest form of machine-learning-based AI has generated excitement, consternation, calls for outlawing or stopping development, and societal predictions ranging from a utopia to a robot apocalypse.

While many still worry that this technology will collapse society, better-informed commentators have begun to prevail. As we begin to understand how GPTs work and how they are best utilized, the debate has become more productive and less panic-ridden.

Further discussions must focus on how GPTs can be used in the policy arena. GPTs are another example of a dual-use technology: beneficial in some applications but concerning in others. As governments exert influence in the global security landscape, many ponder how GPTs will change the balance of power between offense and defense in cybersecurity. In particular, some worry that GPTs will allow vulnerabilities to be discovered and exploited at an increased rate, swinging the delicate balance in cybersecurity even more in favor of the attackers.

To begin to understand issues around the use of GPTs, we should understand how these models work. Models created by GPTs are large statistical models trained on large amounts of text. Such models can then use existing content to predict what words will come next. If you ask ChatGPT, for example, to tell you about Paul Revere, the program will begin generating sentences similar to what you would be likely to find if you were reading parts of a training set that contained the words “Paul Revere.” The results appear as though they were written by a human because the system was trained on what humans write.

The ability to generate statistically likely phrases also makes GPTs useful as a coding tool. Much of writing code is fairly formulaic, but writing code is only a small part of programming, a distinction we will discuss below. For many tasks, a fair amount of boilerplate code must be written. Many examples of this kind of code already exist on the web, either in tutorials or on web-accessible repositories like GitHub (which was used to train CoPilot). So ChatGPT can write the boilerplate code.

The code can then be examined and altered. Programmers can even go back to ChatGPT to request specific changes to the program. GPT-produced code often misses some error handling but has a broad knowledge of various libraries. It provides a useful starting point for a human programmer.

This coding capability has led some to claim that GPTs will soon replace programmers, but this mistakenly assumes that programmers only write code. Programmers must also decide what code must be written, necessary variations, and how pieces of code fit together, among other big-picture issues. GPT-based tools may make programmers more efficient, but they will not replace programmers altogether.

GPT-based tools can also help programmers debug code. Debugging is the process of uncovering and removing coding errors. Industry estimates (many of which are old but have become part of industry folk wisdom) establish that there are anywhere between 1 and 25 bugs in every 1,000 lines of code. Given that a program like Microsoft Windows has millions of lines of code, debugging is a critical function.

Though tools for finding and fixing bugs are constantly being created and adapted, debugging remains complex. For many bugs, a GPT-based discovery tool could prove helpful. Many bugs are the result of a programmer leaving out some checking code, not recognizing a boundary condition, or mistaking one kind of data entry for another. Buffer overflows are bugs that occur when a programmer writes beyond the allocated memory of a buffer, allowing an attacker to overwrite adjacent memory locations with their own code and execute arbitrary commands, escalate privileges, or gain unauthorized access to the system. GPT-based tools could recognize these types of bugs, as they are common enough that there are many examples available to train the models.

The security worry is that attackers could use GPT-based tools to find and exploit bugs. While not all bugs are exploitable, most exploits occur because of a bug—a buffer not checked for size, an unprotected network connection, or an unencrypted login credential left in unprotected memory.

The worry that GPT-based tools are going to shift the balance between offense and defense rests on a misunderstanding about software flaws. Not all bugs are the same. Most bugs are shallow bugs–mistakes that are easily recognized, fairly common, and easy to repair. GPT-based tools can detect such shallow bugs but not deep bugs, which exist in the deepest part of a system’s design. Deep bugs are difficult to identify or fix, often requiring extensive investigation and debugging efforts. For example, the security flaw discovered in the Java reflection mechanism in 2016 was a deep bug that took years to fix. It was caused, not by a minor flaw in code but instead by an unexpected interaction between parts of the system (that had otherwise been working as expected). Fixing this bug required rethinking the basic system design and how its components interacted, all while ensuring that the rest of the code would not break because of the changes.

Nevertheless, shallow bugs can also cause serious security flaws. The OpenSSL heart bleed vulnerability discovered in 2014 was a shallow bug caused by an unchecked buffer sizing. The bug silently leaked data to an adversary, one of the worst vulnerabilities. But once discovered, it was easy to fix, requiring the change of only a few lines of code. Fixing the bug did not affect on any program that used the fixed code; everything that had worked continued to work after the fix.

This holds particular relevance for governments as they navigate cyberattacks and defense strategies. While attackers can use GPT-based tools to scan code for exploitable defects, defenders can use the same tools to find and repair the same defects. Once the exploit is seen in the wild, GPT-based tools can find the flaw in the code that led to the exploit and help with the fix. OpenAI recently launched a program to find bugs in its own artificial intelligence system. Thus, the race between bug exploiter and exterminator remains fairly even, as it is without these tools. GPT-based systems do not make deep vulnerabilities easier to find.

From the perspective of the policymaker, the emergence and general use of GPT-based coding tools will not change the security landscape. While they may make some shallow bugs easier to find, their use by defenders will likely offset any advantage that might be gained by attackers. Indeed, we can hope that GPT-based tools will result in software that is more reliable because of their ability to find such bugs.

Policymakers still have much to worry about regarding the emergence of GPTs. These technologies raise questions related to intellectual property, academic integrity, content moderation, and detecting deep fakes. These are examples of areas where policy may be needed. However, GPT technologies will not change the cybersecurity landscape, and therefore policymakers would do well to turn their attention elsewhere.

 

This article was originally published in the Georgetown Journal of International Affairs