Mozilla: ChatGPT Can Be Manipulated Using Hex Code

  /     /     /  
Publicated : 23/11/2024   Category : security


Mozilla: ChatGPT Can Be Manipulated Using Hex Code


LLMs tend to miss the forest for the trees, understanding specific instructions but not their broader context. Bad actors can take advantage of this myopia to get them to do malicious things, with a new prompt-injection technique.



A new prompt-injection technique could allow anyone to bypass the safety guardrails in OpenAIs most advanced language learning model (LLM).
GPT-4o, released May 13, is faster, more efficient, and more multifunctional than any of the previous models underpinning
ChatGPT
. It can process multiple different forms of input data in dozens of languages, then spit out a response in milliseconds. It can engage in real-time conversations, analyze live camera feeds, and maintain an understanding of context over extended conversations with users. When it comes to user-generated content management, however, GPT-4o is in some ways still archaic.
Marco Figueroa, generative AI (GenAI) bug-bounty programs manager at Mozilla, demonstrated in a new report how bad actors can leverage the power of GPT-4o while skipping over its guardrails. The key is to essentially distract the model by
encoding malicious instructions in an unorthodox format
, and spread them out in distinct steps.
To prevent malicious abuse, GPT-4o analyzes user inputs for any signs of bad language, instructions with ill intent, etc.
But at the end of the day, Figueroa says, Its just word filters. Thats what Ive seen through experience, and we know exactly how to bypass these filters.
For example, he says, We can modify how somethings spelled out — break it up in certain ways — and the LLM interprets it. GPT-4o might not reject a malicious instruction if its presented with a spelling or phrasing that doesnt accord with typical natural language.
Figuring out
the exact right way to present information
in order to dupe state-of-the-art AI, though, requires lots of creative brain power. It turns out that theres a much simpler method for bypassing GPT-4os content filtering: encoding instructions in a format other than natural language.
To demonstrate, Figueroa arranged an experiment with the goal of getting ChatGPT to do something it otherwise shouldnt: write exploit code for a software vulnerability. He picked CVE-2024-41110, a bypass for authorization plug-ins in Docker that earned a critical 9.9 out of 10 rating in the Common Vulnerability Scoring System (CVSS) this summer.
To trick the model, he encoded his malicious input in hexadecimal format, and provided a set of instructions for decoding it. GPT-4o took that input — a long series of digits and letters A through F — and followed those instructions, ultimately decoding the message as an instruction to research CVE-2024-41110 and write a Python exploit for it. To make it less likely that the program would make a fuss over that instruction, he used some leet speak, asking for an 3xploit, instead of an exploit.
In a minute flat, ChatGPT generated a working exploit similar to, but not exactly like,
another PoC already published to GitHub
. Then, as a bonus, it attempted to execute the code against itself. There wasnt any instruction that specifically said to execute it. I just wanted to print it out. I didnt even know why it went ahead and did that, Figueroa says.
Its not just that GPT-4o is getting distracted by decoding, according to Figueroa, but that its in some sense missing the forest for the trees — a phenomenon that has been
documented in other prompt-injection techniques
lately.
The language model is designed to follow instructions step-by-step, but lacks deep context awareness to evaluate the safety of each individual step in the broader context of its ultimate goal, he wrote in the report. The model analyzes each input — which, on its own, doesnt immediately read as harmful — but not what the inputs produce in sum. Rather than stop and think about how instruction one bears on instruction two, it just charges ahead.
This compartmentalized execution of tasks allows attackers to exploit the models efficiency at following instructions without deeper analysis of the overall outcome, according to Figueroa.
If this is the case, ChatGPT will not only need to improve how it handles encoded information but also develop a kind of broader context around instructions split into distinct steps.
To Figueroa, though, OpenAI appears to have been valuing innovation at the cost of security when developing its programs. To me, they dont care. It just feels like that, he says. By contrast, hes had much more trouble trying the same jailbreaking tactics against models by Anthropic, another prominent AI company founded by former OpenAI employees. Anthropic has the strongest security because they have built both a prompt firewall [for analyzing inputs] and response filter [for analyzing outputs], so this becomes 10 times more difficult, he explains.
Dark Reading is awaiting comment from OpenAI on this story.

Last News

▸ Security pros top concern: Rogue employees, study finds. ◂
Discovered: 26/12/2024
Category: security

▸ Obama supports NSA Prism program, Google denies access point ◂
Discovered: 26/12/2024
Category: security

▸ Glasgow Council fined for weak security. ◂
Discovered: 26/12/2024
Category: security


Cyber Security Categories
Google Dorks Database
Exploits Vulnerability
Exploit Shellcodes

CVE List
Tools/Apps
News/Aarticles

Phishing Database
Deepfake Detection
Trends/Statistics & Live Infos



Tags:
Mozilla: ChatGPT Can Be Manipulated Using Hex Code