Understanding Black-Box Algorithm Attacks on AI Systems
Best-of-N Jailbreaking: A Simple Black-Box Algorithm Attack on Cutting-Edge AI Systems https://arxiv.org/abs/2412.03556 Best-of-N (BoN) Jailbreaking is a simple black-box algorithm that can attack various cutting-edge AI systems across modalities. The algorithm works by repeatedly sampling and enhancing prompts (such as random shuffling or capitalization of text prompts) until a harmful response is triggered. Research has … Read more