UPDF AI

Many-shot Jailbreaking

Cem Anil,Esin Durmus,31 Authors,D. Duvenaud

2024 · DOI: 10.52202/079017-4121
Neural Information Processing Systems · 188 Citations

TLDR

This work investigates a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior, and suggests very long contexts present a rich new attack surface for LLMs.

Abstract

We investigate a family of simple long-context attacks on large language models: prompting with hundreds of demonstrations of undesirable behavior. This is newly feasible with the larger context windows recently deployed by Anthropic, Ope-nAI and Google DeepMind. We find that in diverse, realistic circumstances, the effectiveness of this attack follows a power law, up to hundreds of shots. We demonstrate the success of this attack on the most widely used state-of-the-art closed-weight models, and across various tasks. Our results suggest very long contexts present a rich new attack surface for LLMs.