arxiv:2407.16318

PrimeGuard: Safe and Helpful LLMs through Tuning-Free Routing

Published on Jul 23

· Submitted by

eliolio on Jul 24

Upvote

Authors:

Blazej Manczak ,

Eliott Zemour ,

Vaikkunth Mugunthan

Abstract

Deploying language models (LMs) necessitates outputs to be both high-quality and compliant with safety guidelines. Although Inference-Time Guardrails (ITG) offer solutions that shift model output distributions towards compliance, we find that current methods struggle in balancing safety with helpfulness. ITG Methods that safely address non-compliant queries exhibit lower helpfulness while those that prioritize helpfulness compromise on safety. We refer to this trade-off as the guardrail tax, analogous to the alignment tax. To address this, we propose PrimeGuard, a novel ITG method that utilizes structured control flow. PrimeGuard routes requests to different self-instantiations of the LM with varying instructions, leveraging its inherent instruction-following capabilities and in-context learning. Our tuning-free approach dynamically compiles system-designer guidelines for each query. We construct and release safe-eval, a diverse red-team safety benchmark. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, overcomes the guardrail tax by (1) significantly increasing resistance to iterative jailbreak attacks and (2) achieving state-of-the-art results in safety guardrailing while (3) matching helpfulness scores of alignment-tuned models. Extensive evaluations demonstrate that PrimeGuard, without fine-tuning, outperforms all competing baselines and overcomes the guardrail tax by improving the fraction of safe responses from 61% to 97% and increasing average helpfulness scores from 4.17 to 4.29 on the largest models, while reducing attack success rate from 100% to 8%. PrimeGuard implementation is available at https://github.com/dynamofl/PrimeGuard and safe-eval dataset is available at https://huggingface.co/datasets/dynamoai/safe_eval.

View arXiv page View PDF Add to collection

Community

eliolio

Paper author Paper submitter Jul 24

Primeguard 🤺 is a novel Inference-Time Guardrailing (ITG) approach that outperforms all competing baselines for both safety and helpfulness. Throughout our extensive experiments, we found that Primeguard significantly reduces trade-off between AI safety and performance, making it a powerful option for productionizing enterprise-grade AI solutions in compliance with emerging regulation.

Presented at ICML 2024 NextGenAISafety workshop.

nielsr

Jul 29

Hi @eliolio thanks for publishing your artifacts on the hub!

Would be great to link the dataset to this paper, see here on how to do that: https://huggingface.co/docs/hub/en/datasets-cards#linking-a-paper.

Cheers,

Niels
Open-source @ HF

librarian-bot

Jul 25

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2407.16318 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2407.16318 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.