Almost Surely Safe Alignment of Large Language Models at Inference-Time
Abstract
Even highly capable large language models (LLMs) can produce biased or unsafe responses, and alignment techniques, such as RLHF, aimed at mitigating this issue, are expensive and prone to overfitting as they retrain the LLM. This paper introduces a novel inference-time alignment approach that ensures LLMs generate safe responses almost surely, i.e., with a probability approaching one. We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process within the LLM's latent space. Crucially, we augment a safety state that tracks the evolution of safety constraints and enables us to demonstrate formal safety guarantees upon solving the MDP in the latent space. Building on this foundation, we propose InferenceGuard, a practical implementation that safely aligns LLMs without modifying the model weights. Empirically, we demonstrate InferenceGuard effectively balances safety and task performance, outperforming existing inference-time alignment methods in generating safe and aligned responses.
Community
We developed a method that ensures almost-sure safety (i.e., safety with probability approaching 1). We proved this result. We then, present a practical implementation which we call InferenceGuard. InferenceGuard has impressive practical results: 91.04% on Alpaca-7B and 100% safety results on Beaver 7B-v3.
Now, it is easy to get high safety results like those if we want a dumb model, e.g., just don't answer or answer with EOS and so on. However, our goal is not to only have safe results, but also to make sure that the rewards are high - we want a good trade-off between safety and rewards! That's exactly, what we show. InferenceGuard achieves that!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Latent-space adversarial training with post-aware calibration for defending large language models against jailbreak attacks (2025)
- Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models (2024)
- ASTRAL: Automated Safety Testing of Large Language Models (2025)
- Internal Activation Revision: Safeguarding Vision Language Models Without Parameter Update (2025)
- Spot Risks Before Speaking! Unraveling Safety Attention Heads in Large Vision-Language Models (2025)
- Few-shot Steerable Alignment: Adapting Rewards and LLM Policies with Neural Processes (2024)
- GReaTer: Gradients over Reasoning Makes Smaller Language Models Strong Prompt Optimizers (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper