metadata

license: gpl-3.0
metrics:
  - rouge
language:
  - zh
pipeline_tag: question-answering

Ziya-Reader-13B-v1.0

姜子牙系列模型

简介 Brief Introduction

Ziya-Reader-13B-v1.0是一个知识问答模型给定问题和知识文档可以准确回答问题，用于多文档或单文档问答。该模型具有8k的上下文窗口，相比其他具有更长窗口的模型，我们在多个长文本任务的评测中胜出。包括多文档问答、合成任务（文档检索）长文本摘要。

该模型主要面向知识库问答、检索问答、电商客服等场景，在私域知识问答中有着不错的效果，能广泛应用于法律、金融、医疗等垂直领域。因为它解决了多文档问答中当正确信息不在首个或末尾文档中时，回答准确率大幅降低的问题。

另外，模型的通用能力同样出众，可以进行通用问答。它在我们的通用能力评估集上的效果超过了Ziya-Llama-13B-v1.1.

"Ziya-Reader-13B-v1.0" is a knowledge question-answering model. It can accurately answer questions given questions and knowledge documents, and is suitable for both multi-document and single-document question-answering. The model has an 8k context window, and compared to models with longer windows, we have achieved victory in evaluations across multiple long-text tasks. The tasks include multi-document question-answering, synthetic tasks (document retrieval), and long-text summarization.

Additionally, the model also demonstrates excellent generalization capabilities, enabling it to be used for general question-answering. Its performance on our general ability evaluation set surpassed that of Ziya-Llama-13B.

它基于13B的Llama2训练，在数十万通用数据和检索问答数据上进行微调得到。

评估结果 Evaluation

Longbench Chinese

model	Multi-doc QA（%）	Synthetic task(%)	Summarization
GPT3.5-turbo-16k	28.7	77.5	16.0
Longchat-v1.5-7B-32k	19.5	7.6	9.9
Xgen-7B-8k	11.0	3.5	2.2
InternlM-7B-8k	16.3	0.9	12.4
ChatGLM2-6B-32k	37.6	64.5	16.2
Vicuna-v1.5-7B-16k	19.3	5.0	15.1
Ziya-Reader-13B-v1.0	42.8	66.0	15.3

model	LongBench 中文Multi-doc QA（%）	LongBench 中文Multi-doc QA shuffled(%)
gpt3.5-turbo-16k	28.7	23.1
chatGLM2-32k	34.3	20.3
Baichuan-13B-Chat2	32.4	27.2
Ziya-Reader-13B-v1.0	42.8	40.9

模型分类 Model Taxonomy

需求 Demand	任务 Task	系列 Series	模型 Model	参数 Parameter	额外 Extra
问答QA，阅读理解MRC	AGI模型	姜子牙 Ziya	Llama2	13B	Chinese

模型信息 Model Information

我们使用了位置插值（PI）的方式，在精选的长文档语料上进行微调，扩展上下文到8k大小。其次，模型靠数据喂养，我们从近千万数据中筛选高质量数据，仅用层层过滤的10万量级的数据即可将一个平平无奇的模型培养成知识问答小钢炮。另外，我们为搜索任务量身定做了特殊的任务，精心制作了数据，让模型学会从中寻找相关文档并回答问题。

Usage

环境

pip install transformers=4.31.0

example

通用问答时，直接在问题前后加"<human>:"和"\n<bot>:"即可。

进行阅读理解类问答时：问题请放在前面，然后放上下文（知识文档），instruction放到最后。多个检索结果时，每个检索结果用”<eod>\n“分隔，开头使用方括号标识序号。如"[1] xxxxxxx<eod>\n"。

生成结果偶尔会有“根据上面编号为xx的信息”，真正答案从“我的答案是”后开始，解码时请截断前面语句。 dtype：Bfloat16


from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = torch.device("cuda")

prompt='<human>: 给定问题：交强险过期不上路会不会被罚？\n 检索结果：[1] 交强险过期不上路会不会
被罚|法律分析：由于交强险是由保险公司对被保险机动车发生道路交通事故造成受害人(不包括本车人员和被保险人)的人身伤亡、财产损失，在责任限额内>予以赔偿的强制性责任保险。因此一旦交强险到期没续费，发生事故车主还会面临巨额赔偿。车险到期未交有处罚。法律依据：《机动车交通事故责任强制保
险条例》 第三十八条 机动车所有人、管理人未按照规定投保机动车交通事故责任强制保险的，由公安机关交通管理部门扣留机动车，通知机动车所有人、管
理人依照规定投保，处依照规定投保最低责任限额应缴纳的保险费的2倍罚款。 机动车所有人、管理人依照规定补办机动车交通事故责任强制保险的，应当及
时退还机动车。<eod>\n请阅读理解上面多个检索结果，正确地回答问题。只能根据相关的检索结果或者知识回答，禁止编造；如果没有相关结果，请回答“都
不相关，我不知道”。\n<bot>:'
model_path="IDEA-CCNL/Ziya-Reader-13B-v1.0"
model = AutoModelForCausalLM.from_pretrained(model_path,torch_dtype=torch.bfloat16).to(device)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
generate_ids = model.generate(
            input_ids,
            max_new_tokens=512, 
            do_sample = True, 
            top_p = 0.8, 
            temperature = 0.85, 
            repetition_penalty=1., 
            eos_token_id=tokenizer.encode("</s>"), 
            )
output = tokenizer.batch_decode(generate_ids)[0]
print(output)
'''预测结果：对于问题“交强险过期不上路会不会被罚？”，根据上面的编号为1的信息，我的答案是是的，交强险过期不上路会
被罚。根据《机动车交通事故责任强制保险条例》，机动车所有人、管理人未按照规定投保机动车交通事故责任强制保险的，由公安机关交通管理部门扣留机
动车，通知机动车所有人、管理人依照规定投保，处依照规定投保最低责任限额应缴纳的保险费的2倍罚款。因此，交强险过期不上路会被罚，车主需要及时补办机动车交通事故责任强制保险，以避免被罚款。"
'''

引用 Citation

如果您在您的工作中使用了我们的模型，可以引用我们的论文：

If you are using the resource for your work, please cite our paper:

@article{fengshenbang,
  author    = {Jiaxing Zhang and Ruyi Gan and Junjie Wang and Yuxiang Zhang and Lin Zhang and Ping Yang and Xinyu Gao and Ziwei Wu and Xiaoqun Dong and Junqing He and Jianheng Zhuo and Qi Yang and Yongfeng Huang and Xiayu Li and Yanghan Wu and Junyu Lu and Xinyu Zhu and Weifeng Chen and Ting Han and Kunhao Pan and Rui Wang and Hao Wang and Xiaojun Wu and Zhongshen Zeng and Chongpei Chen},
  title     = {Fengshenbang 1.0: Being the Foundation of Chinese Cognitive Intelligence},
  journal   = {CoRR},
  volume    = {abs/2209.02970},
  year      = {2022}
}

You can also cite our website:

欢迎引用我们的网站:

@misc{Fengshenbang-LM,
  title={Fengshenbang-LM},
  author={IDEA-CCNL},
  year={2021},
  howpublished={\url{https://github.com/IDEA-CCNL/Fengshenbang-LM}},
}