File size: 9,023 Bytes
8024d8a 0bef822 e9a4aa4 19f6cb9 e9a4aa4 d09542b e9a4aa4 538f56e e9a4aa4 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e 883aac5 538f56e d21a3df e9a4aa4 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 |
---
license: cc-by-nc-sa-4.0
language:
- zh
pipeline_tag: summarization
tags:
- mT5
- summarization
---
# HeackMT5-ZhSum100k: A Summarization Model for Chinese Texts
This model, `heack/HeackMT5-ZhSum100k`, is a fine-tuned mT5 model for Chinese text summarization tasks. It was trained on a diverse set of Chinese datasets and is able to generate coherent and concise summaries for a wide range of texts.
## Model Details
- Model: mT5
- Language: Chinese
- Training data: Mainly Chinese Financial News Sources, NO BBC or CNN source. Training data contains 1M lines.
- Finetuning epochs: 10
## Evaluation Results
The model achieved the following results:
- ROUGE-1: 56.46
- ROUGE-2: 45.81
- ROUGE-L: 52.98
- ROUGE-Lsum: 20.22
## Usage
Here is how you can use this model for text summarization:
```python
from transformers import MT5ForConditionalGeneration, T5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
tokenizer = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")
chunk = """
财联社5月22日讯,据平安包头微信公众号消息,近日,包头警方发布一起利用人工智能(AI)实施电信诈骗的典型案例,福州市某科技公司法人代表郭先生10分钟内被骗430万元。
4月20日中午,郭先生的好友突然通过微信视频联系他,自己的朋友在外地竞标,需要430万保证金,且需要公对公账户过账,想要借郭先生公司的账户走账。
基于对好友的信任,加上已经视频聊天核实了身份,郭先生没有核实钱款是否到账,就分两笔把430万转到了好友朋友的银行卡上。郭先生拨打好友电话,才知道被骗。骗子通过智能AI换脸和拟声技术,佯装好友对他实施了诈骗。
值得注意的是,骗子并没有使用一个仿真的好友微信添加郭先生为好友,而是直接用好友微信发起视频聊天,这也是郭先生被骗的原因之一。骗子极有可能通过技术手段盗用了郭先生好友的微信。幸运的是,接到报警后,福州、包头两地警银迅速启动止付机制,成功止付拦截336.84万元,但仍有93.16万元被转移,目前正在全力追缴中。
"""
inputs = tokenizer.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print(summary)
包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元
```
## If you need a longer abbreviation, refer to the following code 如果需要更长的缩略语,参考如下代码:
```python
from transformers import MT5ForConditionalGeneration, T5Tokenizer
model_heack = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
tokenizer_heack = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")
def _split_text(text, length):
chunks = []
start = 0
while start < len(text):
if len(text) - start > length:
pos_forward = start + length
pos_backward = start + length
pos = start + length
while (pos_forward < len(text)) and (pos_backward >= 0) and (pos_forward < 20 + pos) and (pos_backward + 20 > pos) and text[pos_forward] not in {'.', '。',',',','} and text[pos_backward] not in {'.', '。',',',','}:
pos_forward += 1
pos_backward -= 1
if pos_forward - pos >= 20 and pos_backward <= pos - 20:
pos = start + length
elif text[pos_backward] in {'.', '。',',',','}:
pos = pos_backward
else:
pos = pos_forward
chunks.append(text[start:pos+1])
start = pos + 1
else:
chunks.append(text[start:])
break
# Combine last chunk with previous one if it's too short
if len(chunks) > 1 and len(chunks[-1]) < 100:
chunks[-2] += chunks[-1]
chunks.pop()
return chunks
def get_summary_heack(text, each_summary_length=150):
chunks = _split_text(text, 300)
summaries = []
for chunk in chunks:
inputs = tokenizer_heack.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model_heack.generate(inputs, max_length=each_summary_length, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
summary = tokenizer_heack.decode(summary_ids[0], skip_special_tokens=True)
summaries.append(summary)
return " ".join(summaries)
```
## Credits
This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out to me at my WeChat ID: kongyang.
**许可协议 / License Agreement**
---
为维护开源生态的可持续发展,并确保开发者能持续优化模型质量,我们制定以下条款:
## 定义
**"衍生作品"** 指通过量化、剪枝、蒸馏、架构修改等技术手段,直接或间接基于本模型产生的任何变体,包括但不限于:
- GGUF/GGML等量化格式转换产物
- 通过知识蒸馏获得的轻量化模型
- 基于本模型参数进行的架构调整(如层数修改、注意力机制变更)
1. **数据与训练成本说明**
训练高质量AI模型需耗费巨额资源:
- 数据清洗与标注成本占项目总投入的60%以上,且全部采用**国内合规数据源**,避免国际媒体(如BBC)对中文语境的曲解性"幻觉翻译"。
- 本项目坚持使用中立、客观的语料,旨在传播技术普惠性,促进人类理解与文明互鉴。
2. **商业授权条款**
非商业用途: **免费**
若需用于商业场景(包括企业产品/服务):
| 企业类型 | 永久授权费(人民币元) |
|------------|------------|
| 初创企业或个人(年营业额100万以下) | 1,000元|
| 中型企业(年营业额100万以上的非上市公司) | 5,000元|
| 上市公司 | 20,000元|
- 扫码支付后,您的Hugging Face账号将获得商业使用权
- 每家企业仅限绑定1个主账号
**商业授权范围包括:**
对衍生作品的商业性使用,无论其是否经过格式转换或架构修改
**支付方式**:
<img src="https://cdn-uploads.huggingface.co/production/uploads/64475c6870338c037608e2de/FuC0FVXOh8hR-Omu7YtJ-.jpeg"
style="max-width: 500px; height: auto; border: 1px solid #eee; border-radius: 8px;"
alt="支付宝/微信收款码">
3. **原始数据服务**
如需获取原始训练数据,请通过上述二维码支付 **5000元** 并邮件联系 weixin: kongyang
---
To sustain open-source ecosystems and ensure model quality, we establish these terms:
## Definitions
**"Derivative Works"** refer to any variants directly or indirectly derived from this model through technical means including but not limited to:
- Quantized format conversions (GGUF/GGML, etc.)
- Lightweight models obtained via knowledge distillation
- Architectural modifications based on model parameters (e.g., layer adjustments, attention mechanism alterations)
1. **Data & Training Costs**
- Over 60% of project costs are spent on **data cleaning** using **domestic compliant sources**, avoiding biased narratives from international media.
- We commit to neutral, objective training data to promote technological inclusivity.
2. **Commercial License**
**Non-commercial Use**: **Free**
**For Commercial Applications** (including enterprise products/services):
| Enterprise Type | Perpetual License Fee(CNY¥) |
|------------|------------|
| Startups Or Individuals(Annual Revenue < ¥1M) | 1,000|
| Mid-sized Enterprises (Non-listed, Annual Revenue ≥ ¥1M) | 5,000|
| Listed Companies | 20,000|
- Scan QR code and bind your Hugging Face account
- 1 primary account per organization
**Commercial Authorization Includes:**
Commercial use of derivative works, regardless of format conversions or architectural modifications
**Payment Method**:
<img src="https://cdn-uploads.huggingface.co/production/uploads/64475c6870338c037608e2de/FuC0FVXOh8hR-Omu7YtJ-.jpeg"
style="max-width: 500px; height: auto; border: 1px solid #eee; border-radius: 8px;"
alt="支付宝/微信收款码">
3. **Raw Data Access**
For uncleaned raw datasets (including multimodal collections), pay **5000 CNY** via the QR code and email [email protected]
---
**我们相信:技术向善,开源共荣**
**Our Belief: Ethical Tech Thrives Through Open Collaboration**
## WeChat ID
kongyang
## Citation
If you use this model in your research, please cite:
```bibtex
@misc{kongyang2023heackmt5zhsum100k,
title={HeackMT5-ZhSum100k: A Large-Scale Multilingual Abstractive Summarization for Chinese Texts},
author={Kong Yang},
year={2023}
}
|