File size: 9,023 Bytes

---
license: cc-by-nc-sa-4.0
language:
- zh
pipeline_tag: summarization
tags:
- mT5
- summarization
---

# HeackMT5-ZhSum100k: A Summarization Model for Chinese Texts

This model, `heack/HeackMT5-ZhSum100k`, is a fine-tuned mT5 model for Chinese text summarization tasks. It was trained on a diverse set of Chinese datasets and is able to generate coherent and concise summaries for a wide range of texts.

## Model Details

- Model: mT5
- Language: Chinese
- Training data: Mainly Chinese Financial News Sources, NO BBC or CNN source. Training data contains 1M lines.
- Finetuning epochs: 10

## Evaluation Results

The model achieved the following results:

- ROUGE-1: 56.46
- ROUGE-2: 45.81
- ROUGE-L: 52.98
- ROUGE-Lsum: 20.22

## Usage

Here is how you can use this model for text summarization:

```python
from transformers import MT5ForConditionalGeneration, T5Tokenizer

model = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
tokenizer = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")

chunk = """
财联社5月22日讯，据平安包头微信公众号消息，近日，包头警方发布一起利用人工智能（AI）实施电信诈骗的典型案例，福州市某科技公司法人代表郭先生10分钟内被骗430万元。
4月20日中午，郭先生的好友突然通过微信视频联系他，自己的朋友在外地竞标，需要430万保证金，且需要公对公账户过账，想要借郭先生公司的账户走账。
基于对好友的信任，加上已经视频聊天核实了身份，郭先生没有核实钱款是否到账，就分两笔把430万转到了好友朋友的银行卡上。郭先生拨打好友电话，才知道被骗。骗子通过智能AI换脸和拟声技术，佯装好友对他实施了诈骗。
值得注意的是，骗子并没有使用一个仿真的好友微信添加郭先生为好友，而是直接用好友微信发起视频聊天，这也是郭先生被骗的原因之一。骗子极有可能通过技术手段盗用了郭先生好友的微信。幸运的是，接到报警后，福州、包头两地警银迅速启动止付机制，成功止付拦截336.84万元，但仍有93.16万元被转移，目前正在全力追缴中。
"""
inputs = tokenizer.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
summary_ids = model.generate(inputs, max_length=150, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print(summary)

包头警方发布一起利用AI实施电信诈骗典型案例:法人代表10分钟内被骗430万元
```

## If you need a longer abbreviation, refer to the following code 如果需要更长的缩略语，参考如下代码：

```python
from transformers import MT5ForConditionalGeneration, T5Tokenizer

model_heack = MT5ForConditionalGeneration.from_pretrained("heack/HeackMT5-ZhSum100k")
tokenizer_heack = T5Tokenizer.from_pretrained("heack/HeackMT5-ZhSum100k")


def _split_text(text, length):
    chunks = []
    start = 0
    while start < len(text):
        if len(text) - start > length:
            pos_forward = start + length
            pos_backward = start + length
            pos = start + length
            while (pos_forward < len(text)) and (pos_backward >= 0) and (pos_forward < 20 + pos) and  (pos_backward + 20 > pos) and text[pos_forward] not in {'.', '。','，',','} and text[pos_backward] not in {'.', '。','，',','}:
                pos_forward += 1
                pos_backward -= 1
            if pos_forward - pos >= 20 and pos_backward <= pos - 20:
                pos = start + length
            elif text[pos_backward] in {'.', '。','，',','}:
                pos = pos_backward
            else:
                pos = pos_forward
            chunks.append(text[start:pos+1])
            start = pos + 1
        else:
            chunks.append(text[start:])
            break
    # Combine last chunk with previous one if it's too short
    if len(chunks) > 1 and len(chunks[-1]) < 100:
        chunks[-2] += chunks[-1]
        chunks.pop()
    return chunks

def get_summary_heack(text, each_summary_length=150):
    chunks = _split_text(text, 300)
    summaries = []
    for chunk in chunks:
        inputs = tokenizer_heack.encode("summarize: " + chunk, return_tensors='pt', max_length=512, truncation=True)
        summary_ids = model_heack.generate(inputs, max_length=each_summary_length, num_beams=4, length_penalty=1.5, no_repeat_ngram_size=2)
        summary = tokenizer_heack.decode(summary_ids[0], skip_special_tokens=True)
        summaries.append(summary)
    return " ".join(summaries)


```

## Credits
This model is trained and maintained by KongYang from Shanghai Jiao Tong University. For any questions, please reach out to me at my WeChat ID: kongyang.

**许可协议 / License Agreement**  

---

为维护开源生态的可持续发展，并确保开发者能持续优化模型质量，我们制定以下条款：  

## 定义
**"衍生作品"** 指通过量化、剪枝、蒸馏、架构修改等技术手段，直接或间接基于本模型产生的任何变体，包括但不限于：
  - GGUF/GGML等量化格式转换产物
  - 通过知识蒸馏获得的轻量化模型
  - 基于本模型参数进行的架构调整（如层数修改、注意力机制变更）

1. **数据与训练成本说明**  
   训练高质量AI模型需耗费巨额资源：  
   - 数据清洗与标注成本占项目总投入的60%以上，且全部采用**国内合规数据源**，避免国际媒体（如BBC）对中文语境的曲解性"幻觉翻译"。  
   - 本项目坚持使用中立、客观的语料，旨在传播技术普惠性，促进人类理解与文明互鉴。  

2. **商业授权条款**

   非商业用途: **免费**

   
   若需用于商业场景（包括企业产品/服务）：  

| 企业类型   | 永久授权费（人民币元）     |
|------------|------------|
| 初创企业或个人(年营业额100万以下)   | 1,000元|
| 中型企业(年营业额100万以上的非上市公司)   | 5,000元|
| 上市公司   | 20,000元|
   - 扫码支付后，您的Hugging Face账号将获得商业使用权  
   - 每家企业仅限绑定1个主账号

   **商业授权范围包括：**
   对衍生作品的商业性使用，无论其是否经过格式转换或架构修改

   **支付方式**：  
<img src="https://cdn-uploads.huggingface.co/production/uploads/64475c6870338c037608e2de/FuC0FVXOh8hR-Omu7YtJ-.jpeg" 
     style="max-width: 500px; height: auto; border: 1px solid #eee; border-radius: 8px;"
     alt="支付宝/微信收款码">

3. **原始数据服务**  
   如需获取原始训练数据，请通过上述二维码支付 **5000元** 并邮件联系 weixin: kongyang  

---

To sustain open-source ecosystems and ensure model quality, we establish these terms:  

## Definitions
**"Derivative Works"** refer to any variants directly or indirectly derived from this model through technical means including but not limited to:

  - Quantized format conversions (GGUF/GGML, etc.)
  - Lightweight models obtained via knowledge distillation
  - Architectural modifications based on model parameters (e.g., layer adjustments, attention mechanism alterations)

1. **Data & Training Costs**  
   - Over 60% of project costs are spent on **data cleaning** using **domestic compliant sources**, avoiding biased narratives from international media.  
   - We commit to neutral, objective training data to promote technological inclusivity.  

2. **Commercial License**
  **Non-commercial Use**: **Free**

   
  **For Commercial Applications** (including enterprise products/services):
   
  | Enterprise Type  | 	Perpetual License Fee(CNY¥)     |
  |------------|------------|
  | Startups Or Individuals(Annual Revenue < ¥1M)  | 1,000|
  | Mid-sized Enterprises (Non-listed, Annual Revenue ≥ ¥1M)   | 5,000|
  | Listed Companies  | 20,000|
   - Scan QR code and bind your Hugging Face account  
   - 1 primary account per organization  

  **Commercial Authorization Includes:**
  Commercial use of derivative works, regardless of format conversions or architectural modifications

   **Payment Method**:  
<img src="https://cdn-uploads.huggingface.co/production/uploads/64475c6870338c037608e2de/FuC0FVXOh8hR-Omu7YtJ-.jpeg" 
     style="max-width: 500px; height: auto; border: 1px solid #eee; border-radius: 8px;"
     alt="支付宝/微信收款码">

3. **Raw Data Access**  
   For uncleaned raw datasets (including multimodal collections), pay **5000 CNY** via the QR code and email [email protected]  

---

**我们相信：技术向善，开源共荣**  
**Our Belief: Ethical Tech Thrives Through Open Collaboration**
## WeChat ID
kongyang

## Citation

If you use this model in your research, please cite:

```bibtex
@misc{kongyang2023heackmt5zhsum100k,
    title={HeackMT5-ZhSum100k: A Large-Scale Multilingual Abstractive Summarization for Chinese Texts},
    author={Kong Yang},
    year={2023}
}