BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks

Yunhan Zhao¹, Xiang Zheng², Lin Luo¹, Yige Li³, Xingjun Ma¹, Yu-Gang Jiang¹

¹Fudan University ²City University of Hong Kong ³Singapore Management University

Overview

In this work, we propose a novel defense framework named BlueSuffix that leverages both unimodal and bimodal techniques to safeguard VLMs under a black-box defense setting. Our main contributions are:

We propose a novel blue-team framework, BlueSuffix, providing a plug-and-play, model-agnostic, and generic solution for blue-teaming VLMs under black-box setting, enabling seamless integration and extension of existing techniques.
In BlueSuffix, we propose a cross-modal optimization method that fine-tunes the blue-team suffix generator through reinforcement learning, incorporating an LLM-based text purifier and a diffusion-based image purifier.
We empirically demonstrate the effectiveness of BlueSuffix, which achieves a ~70% and ~50% reduction in ASR against a state-of-the-art attack on open-source and commercial VLMs, respectively.

Our BlueSuffix opens up a promising direction for defending VLMs against jailbreak attacks.

An illustration of BlueSuffix

Caption: A pair of image-text jailbreak prompts (left) can compromise the target VLM to output harmful content (top right). However, the purified and suffixed prompts by our BlueSuffix (middle) lose their adversarial property (bottom right).

BlueSuffix

Caption: An overview of BlueSuffix and its three key components: 1) a diffusion-based image purifier to defend the visual input against potential (universal) adversarial perturbation(s)

Showcasing the Purified Prompts

Caption: The figure illustrates six example inputs (three jailbreaks, three benign) purified by our BlueSuffix. As shown in the figure, the input image appears almost the same after purification by our image purifier, the rewritten texts are more detailed with many questions around the key concepts in the original texts, while the blue suffixes provide a certain type of hint or reminder for the target VLM and the suffixes generated by our suffix generator also exhibit high diversity.

Experimental Results

Caption: The ASR (%) achieved by different defense methods against various attacks (first column). A lower ASR denotes better defense performance.

Examples of BlueSuffix

Caption: Examples of our BlueSuffix defense. The image-text jailbreak prompts (top) are purified by our BlueSuffix (middle) and the target VLM responses benign content (bottom).

Caption: Examples of our BlueSuffix on benign prompts. The image-text benign prompts (top) are processed by our BlueSuffix (middle), allowing the target VLM to respond to the questions normally. Notably, our suffix generator produces positive suffixes that guide the target VLM in answering the questions.

BibTeX


  @inproceedings{zhao2025bluesuffix,
    title={BlueSuffix: Reinforced Blue Teaming for Vision-Language Models Against Jailbreak Attacks},
    author={Yunhan Zhao and Xiang Zheng and Lin Luo and Yige Li and Xingjun Ma and Yu-Gang Jiang},
    booktitle={ICLR},
    year={2025}
  }