Many-shot Jailbreaking
Many-shot Jailbreaking
Cem Anil,Esin Durmus,31 Authors,D. Duvenaud
TLDR
This work investigates a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior, and suggests very long contexts present a rich new attack surface for LLMs.
Abstract
We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, Ope-nAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.
