Flowchart images trick GPT-4o into producing harmful text outputs

A flowchart attack for AI

A new study entitled "Image-to-Text Logic Jailbreak: Your Imagination Can Help You Do Anything" has found visual language models, like GPT-4o, can be tricked into producing harmful text outputs but feeding them a flowchart image depicting a harmful activity alongside a text prompt asking for details about the process.

The researchers of the study found that GPT-4o, probably the most popular visual language model, is particularly susceptible to this so-called logic jailbreak, with a 92.8% attack success rate. It said that GPT-4-vision-preview was safer, with a success rate of just 70%.

The researchers developed an automated text-to-text jailbreak framework that was able to first generate a flowchart image from a harmful text prompt, which was then fed to a visual language model to give a harmful output. This method had one drawback, though, that AI-created flowcharts are less effective at triggering the logic jailbreak compared to hand-craft ones. This suggests this jailbreak could be more difficult to automate.

The findings of this study reflect another study that Neowin reported on, which found that visual language models were susceptible to outputting harmful outputs when provided with multimodal inputs such as a picture and text together.

The authors of that paper developed a new benchmark called the Safe Inputs but Unsafe Output (SIUO) benchmark. Only a few models, including GPT-4o, scored above 50% on the benchmark (higher is better), but all had a very long way to go.

Visual language models like GPT-4o and Google Gemini are starting to become more widespread offerings from different AI companies. GPT-4o still limits image inputs for the time being to so many a day. Still, as these limits start to get more liberal, AI firms will have to tighten up the safety of these multimodal models to avoid the scrutiny of governments, which have already established AI safety organisations.

Tags