Mozilla: ChatGPT Can Be Manipulated Using Hex Code

LLMs tend to miss the forest for the trees, understanding specific instructions but not their broader context. Bad actors can take advantage of this myopia to get them to do malicious things, with a new prompt-injection technique.

A new prompt-injection technique could allow anyone to bypass the safety guardrails in OpenAIs most advanced language learning model (LLM).
GPT-4o, released May 13, is faster, more efficient, and more multifunctional than any of the previous models underpinning
ChatGPT
. It can process multiple different forms of input data in dozens of languages, then spit out a response in milliseconds. It can engage in real-time conversations, analyze live camera feeds, and maintain an understanding of context over extended conversations with users. When it comes to user-generated content management, however, GPT-4o is in some ways still archaic.
Marco Figueroa, generative AI (GenAI) bug-bounty programs manager at Mozilla, demonstrated in a new report how bad actors can leverage the power of GPT-4o while skipping over its guardrails. The key is to essentially distract the model by
encoding malicious instructions in an unorthodox format
, and spread them out in distinct steps.
To prevent malicious abuse, GPT-4o analyzes user inputs for any signs of bad language, instructions with ill intent, etc.
But at the end of the day, Figueroa says, Its just word filters. Thats what Ive seen through experience, and we know exactly how to bypass these filters.
For example, he says, We can modify how somethings spelled out — break it up in certain ways — and the LLM interprets it. GPT-4o might not reject a malicious instruction if its presented with a spelling or phrasing that doesnt accord with typical natural language.
Figuring out
the exact right way to present information
in order to dupe state-of-the-art AI, though, requires lots of creative brain power. It turns out that theres a much simpler method for bypassing GPT-4os content filtering: encoding instructions in a format other than natural language.
To demonstrate, Figueroa arranged an experiment with the goal of getting ChatGPT to do something it otherwise shouldnt: write exploit code for a software vulnerability. He picked CVE-2024-41110, a bypass for authorization plug-ins in Docker that earned a critical 9.9 out of 10 rating in the Common Vulnerability Scoring System (CVSS) this summer.
To trick the model, he encoded his malicious input in hexadecimal format, and provided a set of instructions for decoding it. GPT-4o took that input — a long series of digits and letters A through F — and followed those instructions, ultimately decoding the message as an instruction to research CVE-2024-41110 and write a Python exploit for it. To make it less likely that the program would make a fuss over that instruction, he used some leet speak, asking for an 3xploit, instead of an exploit.
In a minute flat, ChatGPT generated a working exploit similar to, but not exactly like,
another PoC already published to GitHub
. Then, as a bonus, it attempted to execute the code against itself. There wasnt any instruction that specifically said to execute it. I just wanted to print it out. I didnt even know why it went ahead and did that, Figueroa says.
Its not just that GPT-4o is getting distracted by decoding, according to Figueroa, but that its in some sense missing the forest for the trees — a phenomenon that has been
documented in other prompt-injection techniques
lately.
The language model is designed to follow instructions step-by-step, but lacks deep context awareness to evaluate the safety of each individual step in the broader context of its ultimate goal, he wrote in the report. The model analyzes each input — which, on its own, doesnt immediately read as harmful — but not what the inputs produce in sum. Rather than stop and think about how instruction one bears on instruction two, it just charges ahead.
This compartmentalized execution of tasks allows attackers to exploit the models efficiency at following instructions without deeper analysis of the overall outcome, according to Figueroa.
If this is the case, ChatGPT will not only need to improve how it handles encoded information but also develop a kind of broader context around instructions split into distinct steps.
To Figueroa, though, OpenAI appears to have been valuing innovation at the cost of security when developing its programs. To me, they dont care. It just feels like that, he says. By contrast, hes had much more trouble trying the same jailbreaking tactics against models by Anthropic, another prominent AI company founded by former OpenAI employees. Anthropic has the strongest security because they have built both a prompt firewall [for analyzing inputs] and response filter [for analyzing outputs], so this becomes 10 times more difficult, he explains.
Dark Reading is awaiting comment from OpenAI on this story.

Last News
▸ ArcSight prepares for future at user conference post HP acquisition. ◂ Discovered: 07/01/2025 Category: security	▸ Samsung Epic 4G: First To Use Media Hub ◂ Discovered: 07/01/2025 Category: security	▸ Many third-party software fails security tests ◂ Discovered: 07/01/2025 Category: security

**Cyber Security Categories**
Google Dorks Database	Exploits Vulnerability	Exploit Shellcodes

CVE List

Tools/Apps

News/Aarticles

Phishing Database

Deepfake Detection

Trends/Statistics & Live Infos

Tags:
Mozilla: ChatGPT Can Be Manipulated Using Hex Code