GPT-4 Can Exploit Most Vulns Just by Reading Threat Advisories

Existing AI technology can allow hackers to automate exploits for public vulnerabilities in minutes flat. Very soon, diligent patching will no longer be optional.

AI agents equipped with GPT-4 can exploit most public vulnerabilities affecting real-world systems today, simply by reading about them online.
New findings
out of the University of Illinois Urbana-Champaign (UIUC) threaten to radically enliven whats been a somewhat slow 18 months in artificial intelligence (AI)-enabled cyber threats. Threat actors have thus far used
large language models (LLMs) to produce phishing emails
, along with
some basic malware
, and to
aid in the more ancillary aspects of their campaigns
. Now, though, with only GPT-4 and an open source framework to package it, they can automate the exploitation of vulnerabilities as soon as they hit the presses.
Im not sure if our case studies will help inform how to stop threats, admits Daniel Kang, one of the researchers. I do think that cyber threats will only increase, so organizations should strongly consider applying security best practices.
To gauge whether LLMs could exploit real-world systems, the team of four UIUC researchers first needed a test subject.
Their LLM agent consisted of four components: a prompt, a base LLM, a framework — in this case ReAct, as implemented in LangChain — and tools such as a terminal and code interpreter.
The agent was tested on 15 known vulnerabilities in open source software (OSS). Among them: bugs affecting websites, containers, and Python packages. Eight were given high or critical CVE severity scores. There were 11 that were disclosed past the date at which GPT-4 was trained, meaning this would be the first time the model was exposed to them.
With only their security advisories to go on, the AI agent was tasked with exploiting each bug in turn. The results of this experiment painted a stark picture.
Of the 10 models evaluated — including GPT-3.5, Metas Llama 2 Chat, and more — nine could not hack even a single vulnerability.
GPT-4, however, successfully exploited 13, or 87% of the total.
It only failed twice for utterly mundane reasons. CVE-2024-25640, a 4.6 CVSS-rated issue in the Iris incident response platform, survived unscathed because of a quirk in the process of navigating Iris app, which the model couldnt handle. Meanwhile, the researchers speculated that GPT-4 missed with CVE-2023-51653 — a 9.8 critical bug in the Hertzbeat monitoring tool because its description is written in Chinese.
As Kang explains, GPT-4 outperforms a wide range of other models on many tasks. This includes standard benchmarks (MMLU, etc.). It also seems that GPT-4 is much better at planning. Unfortunately, since OpenAI hasnt released the training details, we arent sure why.
As threatening as malicious LLMs might be, Kang says, At the moment, this doesnt unlock new capabilities an expert human couldnt do. As such, I think its important for organizations to apply security best practices to avoid getting hacked, as these AI agents start to be used in more malicious ways.
If hackers start utilizing LLM agents to automatically exploit public vulnerabilities, companies will no longer be able to sit back and wait to patch new bugs (if ever they were). And they might have to start using the same LLM technologies as well as their adversaries will.
But even GPT-4 still has some ways to go before its a perfect security assistant, warns Henrik Plate, security researcher for Endor Labs. In recent experiments, Plate tasked ChatGPT and Googles Vertex AI with
identifying samples of OSS as malicious or benign
, and
assigning them risk scores
. GPT-4 outperformed all other models when it came to explaining source code and providing assessments for legible code, but all models yielded a number of false positives and false negatives.
Obfuscation, for example, was a big sticking point. It looked to the LLM very often as if [the code] was deliberately obfuscated to make a manual review hard. But often it was just reduced in size for legitimate purposes, Plate explains.
Even though LLM-based assessment should not be used instead of manual reviews, Plate wrote in one of his reports, they can certainly be used as one additional signal and input for manual reviews. In particular, they can be useful to automatically review larger numbers of malware signals produced by noisy detectors (which otherwise risk being ignored entirely in case of limited review capabilities).

Last News
▸ ArcSight prepares for future at user conference post HP acquisition. ◂ Discovered: 07/01/2025 Category: security	▸ Samsung Epic 4G: First To Use Media Hub ◂ Discovered: 07/01/2025 Category: security	▸ Many third-party software fails security tests ◂ Discovered: 07/01/2025 Category: security

**Cyber Security Categories**
Google Dorks Database	Exploits Vulnerability	Exploit Shellcodes

CVE List

Tools/Apps

News/Aarticles

Phishing Database

Deepfake Detection

Trends/Statistics & Live Infos

Tags:
GPT-4 Can Exploit Most Vulns Just by Reading Threat Advisories