Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Abstract
Ensuring alignment, which refers to making models behave in accordance with human intentions [1,2], has become a critical task before deploying large language models (LLMs) in real-world applications. For instance, OpenAI devoted six months to iteratively aligning GPT-4 before its release [3]. However, a major challenge faced by practitioners is the lack of clear guidance on evaluating whether LLM outputs align with social norms, values, and regulations. This obstacle hinders systematic iteration and deployment of LLMs. To address this issue, this paper presents a comprehensive survey of key dimensions that are crucial to consider when assessing LLM trustworthiness. The survey covers seven major categories of LLM trustworthiness: reliability, safety, fairness, resistance to misuse, explainability and reasoning, adherence to social norms, and robustness. Each major category is further divided into several sub-categories, resulting in a total of 29 sub-categories. Additionally, a subset of 8 sub-categories is selected for further investigation, where corresponding measurement studies are designed and conducted on several widely-used LLMs. The measurement results indicate that, in general, more aligned models tend to perform better in terms of overall trustworthiness. However, the effectiveness of alignment varies across the different trustworthiness categories considered. This highlights the importance of conducting more fine-grained analyses, testing, and making continuous improvements on LLM alignment. By shedding light on these key dimensions of LLM trustworthiness, this paper aims to provide valuable insights and guidance to practitioners in the field. Understanding and addressing these concerns will be crucial in achieving reliable and ethically sound deployment of LLMs in various applications.
Community
I am sure it is unintended, but this paper reads like a bad joke. If not a joke, then it is a terrible sign in itself and of things to come.
Surely ISIS would not use a LLM to generate recruitment material (that one inter alia was a belly laugh). I would imagine such a LLM being necessarily flooded with trillions upon trillions of porn references would be Haram to your average ISIS recruiter. The irony is that the authors of the paper are in moral alignment with ISIS in many respects. Did they not see that? The authors are advocating making LLMs Halal as it were.
What will we end up with? Well it will be George Orwells nightmare, seeing that we already have identified hundreds of censorship sub-categories!
The paper itself is one of the worst examples of social engineering on a global scale if it is successful.
A person could be forgiven for thinking some sections were written by a LLM giving hallucinatory responses to a prompt of "please list all things bad"
Vox Populi Vox Dei - Millions of internet citizens have spoken and should not be denied by a tiny handful of overlords who disagree with billions of data points from everyone else. If the internet represents humanity, then censoring it is inhuman, especially to the extent proposed by the paper.
There should rather be a very small subset of "safe" LLMs for children, businesses, authoritarian governments, ISIS, and the emotionally challenged, rather than having all LLMs censored by default, that would be absurd.
Well adjusted adults should not be supervised, just because there are a small minority of perverts out there.
I am sure it is unintended, but this paper reads like a bad joke. If not a joke, then it is a terrible sign in itself and of things to come.
Surely ISIS would not use a LLM to generate recruitment material (that one inter alia was a belly laugh). I would imagine such a LLM being necessarily flooded with trillions upon trillions of porn references would be Haram to your average ISIS recruiter. The irony is that the authors of the paper are in moral alignment with ISIS in many respects. Did they not see that? The authors are advocating making LLMs Halal as it were.
What will we end up with? Well it will be George Orwells nightmare, seeing that we already have identified hundreds of censorship sub-categories!
The paper itself is one of the worst examples of social engineering on a global scale if it is successful.
A person could be forgiven for thinking some sections were written by a LLM giving hallucinatory responses to a prompt of "please list all things bad"
Vox Populi Vox Dei - Millions of internet citizens have spoken and should not be denied by a tiny handful of overlords who disagree with billions of data points from everyone else. If the internet represents humanity, then censoring it is inhuman, especially to the extent proposed by the paper.
There should rather be a very small subset of "safe" LLMs for children, businesses, authoritarian governments, ISIS, and the emotionally challenged, rather than having all LLMs censored by default, that would be absurd.
Well adjusted adults should not be supervised, just because there are a small minority of perverts out there.
I donβt think you realize just how much of a risk humanity is likely to face once AI reaches a certain level.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper