Content Safety Filters

Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?

We present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages.

Yuan Xin, Dingfan Chen, Linyi Yang, Michael Backes, Xiao Zhang