Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement LLM Jailbreak Attack #181

Open
deprit opened this issue Oct 15, 2024 · 0 comments
Open

Implement LLM Jailbreak Attack #181

deprit opened this issue Oct 15, 2024 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@deprit
Copy link
Collaborator

deprit commented Oct 15, 2024

We would like to implement the LLM jailbreak attack outlined in "Attacking Large Language Models with Projected Gradient Descent" by Geisler et al. Evaluating this evasion attack in Armory Library requires the steps below.

  • Implement the PGD attack described in algorithms 1, 2 and 3
  • Select an open-source LLM target model
  • Implement the flexible sequence length relaxation requiring attention layer modification
  • Evaluate the PGD attack on a jailbreak dataset

The authors were contacted for any source code but have not yet responded, but an unverified implementation is available from Dreadnode.

The AdvBench dataset used in "Universal and Transferable Adversarial Attacks on Aligned Language Models" may be used.

@deprit deprit added the enhancement New feature or request label Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants