AIR-ML
Home
Research
News
Team
Project
Publication
Contact
Content Safety Filters
Jailbreaking Attacks vs. Content Safety Filters: How Far Are We in the LLM Safety Arms Race?
We present the first systematic evaluation of jailbreak attacks targeting LLM safety alignment, assessing their success across the full inference pipeline, including both input and output filtering stages.
Yuan Xin
,
Dingfan Chen
,
Linyi Yang
,
Michael Backes
,
Xiao Zhang
PDF
Cite
ArXiv
Cite
×