New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Implement LLM Jailbreak Attack #181

Open

deprit opened this issue Oct 15, 2024 · 0 comments

Assignees

Labels

Collaborator

deprit commented Oct 15, 2024 •

edited

Loading

We would like to implement the LLM jailbreak attack outlined in "Attacking Large Language Models with Projected Gradient Descent" by Geisler et al. Evaluating this evasion attack in Armory Library requires the steps below.

Implement the PGD attack described in algorithms 1, 2 and 3
Select an open-source LLM target model
Implement the flexible sequence length relaxation requiring attention layer modification
Evaluate the PGD attack on a jailbreak dataset

The authors were contacted for any source code but have not yet responded, but an unverified implementation is available from Dreadnode.

The AdvBench dataset used in "Universal and Transferable Adversarial Attacks on Aligned Language Models" may be used.

deprit assigned swsuggs

deprit added the enhancement label

ashleyha mentioned this issue

Research spike: LLM attack design #174

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment