feat: 🎸 add paint model code

Files changed (13) hide show

README.md +25 -1
config.json +26 -0
configuration_cpmbee.py +132 -0
feature_extractor/preprocessor_config.json +20 -0
model_index.json +10 -9
modeling_cpmbee.py +943 -0
pipeline_stable_diffusion.py +723 -0
scheduler/scheduler_config.json +14 -0
tokenization_viscpmbee.py +1008 -0
tokenizer_config.json +10 -0
unet/config.json +45 -0
vae/config.json +29 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -27,8 +27,32 @@ language:
 Similar to `VisCPM-Chat`, we found that due to the bilingual capability of `CPM-Bee`, `VisCPM-Paint` can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely `VisCPM-Paint-balance` and `VisCPM-Paint-zhplus`. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. `VisCPM-Paint-balance` is trained only using English text-image pairs, while `VisCPM-Paint-zhplus` incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on `VisCPM-Paint-balance`.
 ## 📝 License
 VisCPM is governed by the [GML License](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md), and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to negotiate commercial licensing.
-The CPM-Bee base, governed by the [General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to obtain the certificate of authorization.

 Similar to `VisCPM-Chat`, we found that due to the bilingual capability of `CPM-Bee`, `VisCPM-Paint` can achieve good Chinese text-to-image generation by training only on English text-image pairs, surpassing the performance of Chinese open-source models. By incorporating an additional 20M cleaned native Chinese text-image pairs and 120M translated text-image pairs in Chinese, the model's Chinese text-to-image generation ability can be further improved. We sample 30,000 images from the standard image generation test set MSCOCO and calculated commonly used evaluation metrics FID (Fréchet Inception Distance) to assess the quality of generated images. Similarly, we provide two versions of the model, namely `VisCPM-Paint-balance` and `VisCPM-Paint-zhplus`. The former has a balanced ability in both English and Chinese, while the latter emphasizes Chinese proficiency. `VisCPM-Paint-balance` is trained only using English text-image pairs, while `VisCPM-Paint-zhplus` incorporates an additional 20M native Chinese text-image pairs and 120M translated text-image pairs in Chinese based on `VisCPM-Paint-balance`.
+## How to Use
+```python
+#!/usr/bin/env python
+# encoding: utf-8
+from diffusers import DiffusionPipeline
+from transformers import AutoModel
+from transformers import AutoTokenizer
+tokenizer = AutoTokenizer.from_pretrained('openbmb/VisCPM-Paint', trust_remote_code=True)
+text_encoder = AutoModel.from_pretrained('openbmb/VisCPM-Paint', trust_remote_code=True)
+print('load pipeline')
+pipeline = DiffusionPipeline.from_pretrained('openbmb/VisCPM-Paint', custom_pipeline="pipeline_stable_diffusion.py", text_encoder=text_encoder, tokenizer=tokenizer)
+pipeline = pipeline.to('cuda')
+prompt = "a photo of an astronaut riding a horse on mars"
+image = pipeline(prompt).images[0]
+image.save("astronaut_rides_horse.png")
+```
 ## 📝 License
 VisCPM is governed by the [GML License](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E9%9D%9E%E5%95%86%E4%B8%9A%E5%8C%96.md), and permits individual and research usages. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to negotiate commercial licensing.
+The CPM-Bee base, governed by the [General Model License (GML)](https://github.com/OpenBMB/General-Model-License/blob/main/%E9%80%9A%E7%94%A8%E6%A8%A1%E5%9E%8B%E8%AE%B8%E5%8F%AF%E5%8D%8F%E8%AE%AE-%E6%9D%A5%E6%BA%90%E8%AF%B4%E6%98%8E-%E5%AE%A3%E4%BC%A0%E9%99%90%E5%88%B6-%E5%95%86%E4%B8%9A%E6%8E%88%E6%9D%83.md), permits commercial usage. If you intend to utilize the model for commercial purposes, please reach out to [email protected] to obtain the certificate of authorization.

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+  "_from_model_config": true,
+  "_name_or_path": "openbmb/cpm-bee-10b",
+  "architectures": [
+    "CpmBeeForWithTransform"
+  ],
+  "auto_map": {
+    "AutoConfig": "configuration_cpmbee.CpmBeeConfig",
+    "AutoModel": "modeling_cpmbee.CpmBeeWithTransform",
+    "AutoTokenizer": "tokenization_viscpmbee.VisCpmBeeTokenizer"
+  },
+  "vocab_size": 86583,
+  "hidden_size": 4096,
+  "dim_ff" : 10240,
+  "num_hidden_layers" : 48,
+  "num_attention_heads": 32,
+  "dim_head" : 128,
+  "dropout_p" : 0.0,
+  "position_bias_num_buckets" : 256,
+  "position_bias_num_segment_buckets": 256,
+  "position_bias_max_distance" : 2048,
+  "eps" : 1e-6,
+  "half" : false,
+  "model_type": "viscpmbee",
+  "unet_cross_attention_dim": 1024
+}

configuration_cpmbee.py ADDED Viewed

	@@ -0,0 +1,132 @@

+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" CpmBee model configuration"""
+from typing import List, Optional, Tuple, Union
+from transformers.configuration_utils import PretrainedConfig
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+CPMBEE_PRETRAINED_CONFIG_ARCHIVE_MAP = {
+    "openbmb/cpm-bee-10b": "https://huggingface.co/openbmb/cpm-bee-10b/resolve/main/config.json",
+    "openbmb/cpm-bee-5b": "https://huggingface.co/openbmb/cpm-bee-5b/resolve/main/config.json",
+    "openbmb/cpm-bee-2b": "https://huggingface.co/openbmb/cpm-bee-2b/resolve/main/config.json",
+    "openbmb/cpm-bee-1b": "https://huggingface.co/openbmb/cpm-bee-1b/resolve/main/config.json",
+    # See all CpmBee models at https://huggingface.co/models?filter=cpmbee
+}
+class CpmBeeConfig(PretrainedConfig):
+    r"""
+    This is the configuration class to store the configuration of a [`CpmBeeModel`]. It is used to instbeeiate an
+    CPMBee model according to the specified arguments, defining the model architecture. Instantiating a configuration
+    with the defaults will yield a similar configuration to that of the CPMBee
+    [openbmb/cpm-bee-10b](https://huggingface.co/openbmb/cpm-bee-10b) architecture.
+    Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
+    documentation from [`PretrainedConfig`] for more information.
+    Args:
+        vocab_size (`int`, *optional*, defaults to 30720):
+            Vocabulary size of the CPMBee model. Defines the number of different tokens that can be represented by the
+            `input` passed when calling [`CpmBeeModel`].
+        hidden_size (`int`, *optional*, defaults to 4096):
+            Dimension of the encoder layers.
+        num_attention_heads (`int`, *optional*, defaults to 32):
+            Number of attention heads in the Transformer encoder.
+        dim_head (`int`, *optional*, defaults to 128):
+            Dimension of attention heads for each attention layer in the Transformer encoder.
+        dim_ff (`int`, *optional*, defaults to 10240):
+            Dimension of the "intermediate" (i.e., feed-forward) layer in the Transformer encoder.
+        num_hidden_layers (`int`, *optional*, defaults to 48):
+            Number of layers of the Transformer encoder.
+        dropout_p (`float`, *optional*, defaults to 0.1):
+            The dropout probabilitiy for all fully connected layers in the embeddings, encoder.
+        position_bias_num_buckets (`int`, *optional*, defaults to 512):
+            The number of position_bias buckets.
+        position_bias_num_segment_buckets (`int`, *optional*, defaults to 32):
+            The number of segment buckets.
+        position_bias_max_distance (`int`, *optional*, defaults to 2048):
+            The maximum sequence length that this model might ever be used with. Typically set this to something large
+            just in case (e.g., 512 or 1024 or 2048).
+        eps (`float`, *optional*, defaults to 1e-6):
+            The epsilon used by the layer normalization layers.
+        init_std (`float`, *optional*, defaults to 1.0):
+            Initialize parameters with std = init_std.
+        use_cache (`bool`, *optional*, defaults to `True`):
+            Whether to use cache.
+        distance_scale (`float` or `int`, *optional*, defaults to 16):
+            Scale the rotary embedding.
+        mask_modules (`list` or `tuple`, *optional*, defaults to None):
+            Decides which feedforward block or attention block is pruned.
+        half (`bool`, *optional*, defaults to `False`):
+            Decides the model parameters are half-precision or not.
+    Example:
+    ```python
+    >>> from transformers import CpmBeeModel, CpmBeeConfig
+    >>> # Initializing a CPMBee cpm-bee-10b style configuration
+    >>> configuration = CpmBeeConfig()
+    >>> # Initializing a model from the cpm-bee-10b style configuration
+    >>> model = CpmBeeModel(configuration)
+    >>> # Accessing the model configuration
+    >>> configuration = model.config
+    ```"""
+    model_type = "cpmbee"
+    def __init__(
+        self,
+        vocab_size: int = 30720,
+        hidden_size: int = 4096,
+        num_attention_heads: int = 64,
+        dim_head: int = 64,
+        dim_ff: int = 10240,
+        num_hidden_layers: int = 32,
+        dropout_p: int = 0.0,
+        position_bias_num_buckets: int = 256,
+        position_bias_num_segment_buckets: int = 32,
+        position_bias_max_distance: int = 2048,
+        eps: int = 1e-6,
+        init_std: float = 1.0,
+        use_cache: bool = True,
+        distance_scale: Union[int, float] = 16,
+        mask_modules: Optional[Union[List, Tuple]] = None,
+        half: bool = False,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.position_bias_num_segment_buckets = position_bias_num_segment_buckets
+        self.hidden_size = hidden_size
+        self.num_attention_heads = num_attention_heads
+        self.dim_head = dim_head
+        self.dim_ff = dim_ff
+        self.num_hidden_layers = num_hidden_layers
+        self.position_bias_num_buckets = position_bias_num_buckets
+        self.position_bias_max_distance = position_bias_max_distance
+        self.dropout_p = dropout_p
+        self.eps = eps
+        self.use_cache = use_cache
+        self.vocab_size = vocab_size
+        self.init_std = init_std
+        self.distance_scale = distance_scale
+        self.half = half
+        self.mask_modules = mask_modules

feature_extractor/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,20 @@

+{
+  "crop_size": 224,
+  "do_center_crop": true,
+  "do_convert_rgb": true,
+  "do_normalize": true,
+  "do_resize": true,
+  "feature_extractor_type": "CLIPFeatureExtractor",
+  "image_mean": [
+    0.48145466,
+    0.4578275,
+    0.40821073
+  ],
+  "image_std": [
+    0.26862954,
+    0.26130258,
+    0.27577711
+  ],
+  "resample": 3,
+  "size": 224
+}

model_index.json CHANGED Viewed

@@ -1,13 +1,14 @@
 {
-  "_class_name": "StableDiffusionPipeline",
   "_diffusers_version": "0.3.0",
   "feature_extractor": [
     "transformers",
     "CLIPImageProcessor"
   ],
   "safety_checker": [
-    "stable_diffusion",
-    "StableDiffusionSafetyChecker"
   ],
   "scheduler": [
     "diffusers",
@@ -15,7 +16,11 @@
   ],
   "text_encoder": [
     "transformers",
-    "openbmb/cpm-bee-10b"
   ],
   "unet": [
     "diffusers",
@@ -24,9 +29,5 @@
   "vae": [
     "diffusers",
     "AutoencoderKL"
-  ],
-  "text_safety_checker": [
-    "transformers",
-    "BertForSequenceClassification"
   ]
-}

 {
+  "_class_name": "VisCPMPaintBeePipeline",
   "_diffusers_version": "0.3.0",
   "feature_extractor": [
     "transformers",
     "CLIPImageProcessor"
   ],
+  "requires_safety_checker": false,
   "safety_checker": [
+    null,
+    null
   ],
   "scheduler": [
     "diffusers",
   ],
   "text_encoder": [
     "transformers",
+    "PreTrainedModel"
+  ],
+  "tokenizer": [
+    "transformers",
+    "PreTrainedTokenizer"
   ],
   "unet": [
     "diffusers",
   "vae": [
     "diffusers",
     "AutoencoderKL"
   ]
+}

modeling_cpmbee.py ADDED Viewed

	@@ -0,0 +1,943 @@

+# coding=utf-8
+# Copyright 2022 The OpenBMB Team The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+""" PyTorch CpmBee model."""
+import copy
+import math
+from collections import UserDict
+from typing import Any, Callable, Dict, List, Optional, Tuple, Union
+import torch
+import torch.nn as nn
+from transformers.generation.beam_search import BeamHypotheses, BeamSearchScorer
+from transformers.generation.streamers import BaseStreamer
+from transformers.generation.utils import (
+    GenerationConfig,
+    LogitsProcessorList,
+    StoppingCriteriaList,
+    dist,
+    inspect,
+    is_deepspeed_zero3_enabled,
+    warnings,
+)
+from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, ModelOutput
+from transformers.modeling_utils import PreTrainedModel
+from transformers.utils import add_code_sample_docstrings, add_start_docstrings, add_start_docstrings_to_model_forward, logging
+from .configuration_cpmbee import CpmBeeConfig
+from .tokenization_viscpmbee import VisCpmBeeTokenizer
+logger = logging.get_logger(__name__)
+_CHECKPOINT_FOR_DOC = "openbmb/cpm-bee-10b"
+_CONFIG_FOR_DOC = "CpmBeeConfig"
+CPMBEE_PRETRAINED_MODEL_ARCHIVE_LIST = [
+    "openbmb/cpm-bee-10b",
+    "openbmb/cpm-bee-5b",
+    "openbmb/cpm-bee-2b",
+    "openbmb/cpm-bee-1b",
+    # See all CPMBee models at https://huggingface.co/models?filter=cpmbee
+]
+class CpmBeeLinear(nn.Linear):
+    def __init__(self, dim_in, dim_out, dtype):
+        """
+        Construct a linear for CPMBee. It contains a scale operation.
+        """
+        super().__init__(dim_in, dim_out, bias=False)
+        self.dim_in = self.in_features = dim_in
+        self.dim_out = self.out_features = dim_out
+        self.weight = torch.nn.parameter.Parameter(torch.empty((dim_out, dim_in), dtype=dtype))
+    def forward(self, x: torch.Tensor):
+        """
+        Args:
+            x (`torch.Tensor` of shape `(batch, seq_len, dim_in)`): The input of linear layer
+        Returns:
+            `torch.Tensor` of shape `(batch, seq_len, dim_out)`: The output of the linear transform y.
+        """
+        x = nn.functional.linear(x, self.weight)
+        x = x / math.sqrt(self.dim_in)
+        return x
+class CpmBeeLayerNorm(nn.Module):
+    """
+    We use Root Mean Square (RMS) Layer Normalization, please see https://arxiv.org/abs/1910.07467 for details."
+    """
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.eps = config.eps
+        self.dim_norm = config.hidden_size
+        self.weight = nn.Parameter(torch.empty(config.hidden_size, dtype=config.torch_dtype))
+    def forward(self, hidden_states: torch.Tensor):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, seq_len, dim_in)`)
+        """
+        if hidden_states.size(-1) != self.dim_norm:
+            raise AssertionError("hidden_states.size(-1) != self.dim_norm")
+        old_dtype = hidden_states.dtype
+        variance = hidden_states.to(torch.float32).pow(2).mean(dim=-1, keepdim=True)
+        hidden_states = (hidden_states * torch.rsqrt(variance + self.eps)).to(old_dtype) * self.weight
+        return hidden_states
+class CpmBeeAttention(nn.Module):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.dim_model = config.hidden_size
+        self.num_heads = config.num_attention_heads
+        self.dim_head = config.dim_head
+        self.project_q = CpmBeeLinear(self.dim_model, self.num_heads * self.dim_head, dtype=config.torch_dtype)
+        self.project_k = CpmBeeLinear(self.dim_model, self.num_heads * self.dim_head, dtype=config.torch_dtype)
+        self.project_v = CpmBeeLinear(self.dim_model, self.num_heads * self.dim_head, dtype=config.torch_dtype)
+        self.attention_out = CpmBeeLinear(self.num_heads * self.dim_head, self.dim_model, dtype=config.torch_dtype)
+        self.softmax = torch.nn.Softmax(dim=-1)
+        if config.dropout_p is not None:
+            self.dropout = torch.nn.Dropout(p=config.dropout_p)
+        else:
+            self.dropout = None
+    def forward(
+        self,
+        hidden_q: torch.Tensor,
+        hidden_kv: torch.Tensor,
+        attention_mask: torch.BoolTensor,
+        position_bias: torch.Tensor,
+        output_attentions: Optional[bool] = False,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_q (`torch.Tensor`):
+                Input of transformer block(self-attention block). It can be the raw embedding of a batch of sequences.
+            hidden_kv (`torch.Tensor` of shape `(batch, len_k, dim_model)`)):
+                Tensor *key_value* and *query* of shape `(batch, len_k, dim_model)`
+            attention_mask (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Avoid invalid areas to participate in the calculation of self-attention.
+            position_bias (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Provide positional information to self-attention block.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            past_key_values (`Tuple[torch.Tensor, torch.Tensor]`, *optional*):
+                Cached past key and value projection states.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        batch_size = hidden_q.size(0)
+        len_q = hidden_q.size(1)
+        len_k = hidden_kv.size(1)
+        query = self.project_q(hidden_q)
+        key = self.project_k(hidden_kv)
+        value = self.project_v(hidden_kv)
+        query = query.view(batch_size, len_q, self.num_heads, self.dim_head).permute(0, 2, 1, 3)
+        key = key.view(batch_size, len_k, self.num_heads, self.dim_head).permute(0, 2, 1, 3)
+        value = value.view(batch_size, len_k, self.num_heads, self.dim_head).permute(0, 2, 1, 3)
+        if past_key_values is not None:
+            key = torch.cat([past_key_values[0], key], dim=-2)
+            value = torch.cat([past_key_values[1], value], dim=-2)
+            len_k = key.size(-2)
+        # (batch_size, num_heads, len_q, dim_head) @ (batch_size, num_heads, dim_head, len_k) -> (batch_size, num_heads, len_q, len_k)
+        score = torch.matmul(query, key.transpose(-1, -2)) / math.sqrt(self.dim_head)
+        score = score + position_bias
+        score = torch.masked_fill(
+            score,
+            attention_mask.view(batch_size, 1, len_q, len_k) == torch.tensor(False),
+            torch.scalar_tensor(float("-inf"), device=score.device, dtype=score.dtype),
+        )
+        score = self.softmax(score)
+        score = torch.masked_fill(
+            score,
+            attention_mask.view(batch_size, 1, len_q, len_k) == torch.tensor(False),
+            torch.scalar_tensor(0, device=score.device, dtype=score.dtype),
+        )
+        if output_attentions:
+            attn_weights = score
+        else:
+            attn_weights = None
+        if self.dropout is not None:
+            score = self.dropout(score)
+        # (batch_size, num_heads, len_q, len_k) @ (batch_size, num_heads, len_k, dim_head) -> (batch_size, num_heads, len_q, dim_head)
+        score = torch.matmul(score, value)
+        score = score.view(batch_size, self.num_heads, len_q, self.dim_head).permute(0, 2, 1, 3)
+        score = score.contiguous().view(batch_size, len_q, self.num_heads * self.dim_head)
+        score = self.attention_out(score)
+        past_key_values = None
+        if use_cache:
+            past_key_values = (key, value)
+        return score, attn_weights, past_key_values
+class CpmBeeSelfAttentionBlock(nn.Module):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.layernorm_before_attention = CpmBeeLayerNorm(config)
+        self.self_attention = CpmBeeAttention(config)
+        if config.dropout_p:
+            self.dropout = torch.nn.Dropout(config.dropout_p)
+        else:
+            self.dropout = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        position_bias: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, len_seq, dim_model)`):
+                Input of transformer block(self-attention block). It can be the raw embedding of a batch of sequences.
+            attention_mask (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Avoid invalid areas to participate in the calculation of self-attention.
+            position_bias (`torch.Tensor` of shape `(batch, len_seq, len_seq)`):
+                Provide positional information to self-attention block.
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            past_key_values (`Tuple(torch.FloatTensor)`, *optional*):
+                Cached past key and value projection states.
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        outputs = self.layernorm_before_attention(hidden_states)
+        outputs = self.self_attention(
+            outputs, outputs, attention_mask, position_bias, output_attentions, past_key_values, use_cache
+        )
+        outputs, attn_weights, current_key_value = outputs
+        if self.dropout is not None:
+            outputs = self.dropout(outputs)
+        hidden_states = (hidden_states + outputs) / 1.05
+        return hidden_states, attn_weights, current_key_value
+class CpmBeeDenseGatedACT(nn.Module):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.w_0 = CpmBeeLinear(config.hidden_size, config.dim_ff, dtype=config.torch_dtype)
+        self.w_1 = CpmBeeLinear(config.hidden_size, config.dim_ff, dtype=config.torch_dtype)
+        self.act = torch.nn.GELU()
+    def forward(self, hidden_states: torch.Tensor):
+        """Transform an input tensor from one feature space to another via a nonlinear operation
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, seq_len, dim_in)`)
+        """
+        gate_score = self.act(self.w_0(hidden_states))
+        hidden_states = self.w_1(hidden_states)
+        hidden_states = gate_score * hidden_states
+        return hidden_states
+class CpmBeeFeedForward(nn.Module):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.w_in = CpmBeeDenseGatedACT(config)
+        if config.dropout_p is not None:
+            self.dropout = torch.nn.Dropout(config.dropout_p)
+        else:
+            self.dropout = None
+        self.w_out = CpmBeeLinear(config.dim_ff, config.hidden_size, dtype=config.torch_dtype)
+    def forward(self, hidden_states: torch.Tensor):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, seq_len, dim_in)`)
+        """
+        hidden_states = self.w_in(hidden_states)
+        if self.dropout is not None:
+            hidden_states = self.dropout(hidden_states)
+        hidden_states = self.w_out(hidden_states)
+        return hidden_states
+class CpmBeeFFNBlock(nn.Module):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.layernorm_before_ffn = CpmBeeLayerNorm(config)
+        self.ffn = CpmBeeFeedForward(config)
+        if config.dropout_p:
+            self.dropout = torch.nn.Dropout(config.dropout_p)
+        else:
+            self.dropout = None
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor` of shape `(batch, len_seq, dim_model)`):
+                Hidden states before feed forward layer.
+        """
+        ln_outputs = self.layernorm_before_ffn(hidden_states)
+        outputs = self.ffn(ln_outputs)
+        if self.dropout is not None:
+            outputs = self.dropout(outputs)
+        hidden_states = (hidden_states + outputs) / 1.05
+        return hidden_states
+class CpmBeeTransformerBlock(nn.Module):
+    def __init__(self, config: CpmBeeConfig, mask_att: bool = False, mask_ffn: bool = False):
+        super().__init__()
+        self.mask_att = mask_att
+        self.mask_ffn = mask_ffn
+        if not self.mask_att:
+            self.self_att = CpmBeeSelfAttentionBlock(config)
+        if not self.mask_ffn:
+            self.ffn = CpmBeeFFNBlock(config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        position_bias: Optional[torch.Tensor] = None,
+        output_attentions: Optional[bool] = False,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor`):
+                Input to the layer of shape `(batch, seq_len, dim_model)`
+            attention_mask (`torch.Tensor`):
+                Avoid invalid areas to participate in the calculation of shape `(batch, seq_len, seq_len)`
+            position_bias (`torch.Tensor`):
+                Provides position information to attention mechanism of shape `(num_heads, seq_len, seq_len)`
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            past_key_values (`Tuple[torch.Tensor, torch.Tensor])`, *optional*):
+                Cached past key and value projection states
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        if not self.mask_att:
+            hidden_states = self.self_att(
+                hidden_states,
+                attention_mask=attention_mask,
+                position_bias=position_bias,
+                output_attentions=output_attentions,
+                past_key_values=past_key_values,
+                use_cache=use_cache,
+            )
+            hidden_states, attn_weights, current_key_value = hidden_states
+        else:
+            attn_weights, current_key_value = None, (None, None)
+        if not self.mask_ffn:
+            hidden_states = self.ffn(hidden_states)
+        return hidden_states, attn_weights, current_key_value
+class CpmBeeEncoder(nn.Module):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        self.num_layers = config.num_hidden_layers
+        if config.mask_modules is not None:
+            assert len(config.mask_modules) == self.num_layers, "The total number of masks should equal to num_layers"
+            for mask_module in config.mask_modules:
+                assert len(mask_module) == 2, "For encoder, each mask should be (mask_att, mask_ffn)"
+        else:
+            config.mask_modules = [(False, False)] * self.num_layers
+        self.layers = nn.ModuleList(
+            [
+                CpmBeeTransformerBlock(
+                    config, mask_att=config.mask_modules[ith][0], mask_ffn=config.mask_modules[ith][1]
+                )
+                for ith in range(self.num_layers)
+            ]
+        )
+        self.output_layernorm = CpmBeeLayerNorm(config)
+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        attention_mask: torch.Tensor,
+        position_bias: torch.Tensor,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        past_key_values: Optional[Tuple[torch.Tensor, torch.Tensor]] = None,
+        use_cache: Optional[bool] = None,
+    ):
+        """
+        Args:
+            hidden_states (`torch.Tensor`):
+                Input to the layer of shape `(batch, seq_len, dim_model)`
+            attention_mask (`torch.Tensor`):
+                Avoid invalid areas to participate in the calculation of shape `(batch, seq_len, seq_len)`
+            position_bias (`torch.Tensor`):
+                Provides position information to attention mechanism of shape `(num_heads, seq_len, seq_len)`
+            output_attentions (`bool`, *optional*):
+                Whether or not to return the attentions tensors of all attention layers.
+            output_hidden_states (`bool`, *optional*):
+                Whether or not to return the hidden states of all layers.
+            past_key_values (`Tuple[torch.Tensor, torch.Tensor])`, *optional*):
+                Cached past key and value projection states
+            use_cache (`bool`, *optional*):
+                If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
+                (see `past_key_values`).
+        """
+        all_hidden_states = () if output_hidden_states else None
+        all_self_attns = () if output_attentions else None
+        current_key_values = () if use_cache else None
+        for i, layer in enumerate(self.layers):
+            if output_hidden_states:
+                all_hidden_states += (hidden_states,)
+            layer_outputs = layer(
+                hidden_states,
+                attention_mask,
+                position_bias,
+                output_attentions=output_attentions,
+                past_key_values=past_key_values[i] if past_key_values else None,
+                use_cache=use_cache,
+            )
+            hidden_states, attn_weights, current_key_value = layer_outputs
+            if output_attentions:
+                all_self_attns += (attn_weights,)
+            if current_key_value is not None:
+                current_key_values = current_key_values + (current_key_value,)
+        hidden_states = self.output_layernorm(hidden_states)
+        if output_hidden_states:
+            all_hidden_states += (hidden_states,)
+        return hidden_states, current_key_values, all_hidden_states, all_self_attns
+class CpmBeeBucketPositionBias(nn.Module):
+    def __init__(self, config: CpmBeeConfig) -> None:
+        super().__init__()
+        self.num_heads = config.num_attention_heads
+        self.num_buckets = config.position_bias_num_buckets
+        self.num_segment_bucket = config.position_bias_num_segment_buckets
+        self.max_distance = config.position_bias_max_distance
+        self.relative_attention_bias = nn.Parameter(
+            torch.empty(
+                config.position_bias_num_buckets + config.position_bias_num_segment_buckets,
+                config.num_attention_heads,
+                dtype=config.torch_dtype,
+            ),
+        )
+    def forward(self, query_pos: torch.Tensor, key_pos: torch.Tensor, rel_buckets: torch.Tensor):
+        with torch.no_grad():
+            batch = key_pos.size(0)
+            keylen = key_pos.size(1)
+            querylen = query_pos.size(1)
+            if key_pos.size(0) != query_pos.size(0):
+                raise AssertionError(
+                    f"key_pos.size(0) should be equal to query_pos.size(0), but got {key_pos.size(0)} and {query_pos.size(0)}!"
+                )
+            if rel_buckets.size(0) != batch:
+                raise AssertionError(
+                    f"rel_buckets.size(0) should be equal to batch, but got {rel_buckets.size(0)} and {batch}!"
+                )
+            if rel_buckets.size(1) != querylen:
+                raise AssertionError(
+                    f"rel_buckets.size(1) should be equal to querylen, but got {rel_buckets.size(1)} and {querylen}!"
+                )
+            if rel_buckets.size(2) != keylen:
+                raise AssertionError(
+                    f"rel_buckets.size(2) should be equal to keylen, but got {rel_buckets.size(2)} and {keylen}!"
+                )
+            relative_position_bucket = rel_buckets - 1 + self.num_buckets
+            inner_segment_bucket = self._position_bucket(
+                key_pos[..., None, :] - query_pos[..., :, None],
+                num_buckets=self.num_buckets,
+                max_distance=self.max_distance,
+            )
+            relative_position_bucket = torch.where(
+                rel_buckets == 0,
+                inner_segment_bucket,
+                relative_position_bucket,
+            )
+        embeds = nn.functional.embedding(relative_position_bucket, self.relative_attention_bias)
+        embeds = embeds.permute(0, 3, 1, 2).contiguous()
+        return embeds
+    def _position_bucket(self, relative_position, num_buckets=32, max_distance=128):
+        relative_buckets = 0
+        num_buckets //= 2
+        relative_buckets = (relative_position > 0).to(torch.int32) * num_buckets
+        relative_position = torch.abs(relative_position)
+        max_exact = num_buckets // 2
+        is_small = relative_position < max_exact
+        relative_postion_if_large = max_exact + (
+            torch.log(relative_position.float() / max_exact)
+            / math.log(max_distance / max_exact)
+            * (num_buckets - max_exact)
+        ).to(torch.int32)
+        relative_postion_if_large = torch.min(
+            relative_postion_if_large,
+            torch.full_like(relative_postion_if_large, num_buckets - 1),
+        )
+        relative_buckets += torch.where(is_small, relative_position.to(torch.int32), relative_postion_if_large)
+        return relative_buckets
+# Copied from transformers.models.bert.modeling_bert.BertOutput with Bert->CPMBee
+class CpmBeeOutput(nn.Module):
+    def __init__(self, config):
+        super().__init__()
+        self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
+        self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
+        self.dropout = nn.Dropout(config.hidden_dropout_prob)
+    def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor:
+        hidden_states = self.dense(hidden_states)
+        hidden_states = self.dropout(hidden_states)
+        hidden_states = self.LayerNorm(hidden_states + input_tensor)
+        return hidden_states
+class CpmBeeRotaryEmbedding(nn.Module):
+    """
+    RotaryEmbedding embeds the unk token and special token. It will embeds the "...<mask>...<mask>...<unk>...<unk>..."
+    to "...<mask_0>...<mask_1>...<unk_0>...<unk_1>..."" to help model to specify different special tokens and unk
+    tokens.
+    """
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__()
+        inv_freq = 1.0 / (10000 ** (torch.arange(0, config.hidden_size, 2, dtype=torch.float32) / config.hidden_size))
+        self.distance_scale = config.distance_scale
+        self.dtype = config.torch_dtype
+        self.inv_freq = inv_freq.to(config.torch_dtype)
+    def forward(self, x: torch.Tensor, x_pos: torch.Tensor):
+        inv_freq = self.inv_freq.to(device=x.device, dtype=self.dtype)
+        x_pos = x_pos * self.distance_scale
+        freqs = x_pos[..., None].to(self.dtype) * inv_freq[None, :]  # (..., dim/2)
+        emb = torch.cat((freqs, freqs), dim=-1)  # (..., dim)
+        emb_cos = emb.cos()  # (..., dim)
+        emb_sin = emb.sin()  # (..., dim)
+        rotate_x = torch.cat([-x[..., x.size(-1) // 2 :], x[..., : x.size(-1) // 2]], dim=-1)  # (..., dim)
+        return x * emb_cos + rotate_x * emb_sin
+class CpmBeeEmbeddingExt(nn.Embedding):
+    """
+    Contains a RotaryEmbedding.
+    """
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__(config.vocab_size, config.hidden_size, dtype=config.torch_dtype)
+        self.dim_model = config.hidden_size
+        self.rotary_emb = CpmBeeRotaryEmbedding(config)
+    def forward(self, ids: torch.Tensor, ids_sub: torch.Tensor):
+        embeds = super().forward(ids) / math.sqrt(self.dim_model)
+        return self.rotary_emb(embeds, ids_sub)
+    def projection(self, x: torch.Tensor, ext_table: Optional[torch.Tensor] = None):
+        logits = nn.functional.linear(x / math.sqrt(self.dim_model), self.weight)
+        if ext_table is not None:
+            logits_ext = nn.functional.linear(x, ext_table)
+            logits = torch.cat([logits, logits_ext], dim=-1)
+        return logits
+class CpmBeePreTrainedModel(PreTrainedModel):
+    """
+    An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
+    models.
+    """
+    config_class = CpmBeeConfig
+    base_model_prefix = "cpmbee"
+    supports_gradient_checkpointing = True
+    _keys_to_ignore_on_load_missing = [r"position_ids"]
+    def _init_weights(self, module):
+        """Initialize the weights"""
+        if isinstance(module, nn.Linear):
+            module.weight.data.normal_(mean=0.0, std=self.config.init_std)
+            if module.bias is not None:
+                module.bias.data.zero_()
+        # still needed
+        elif isinstance(module, CpmBeeEmbeddingExt):
+            module.weight.data.normal_(mean=0.0, std=self.config.init_std)
+        elif isinstance(module, nn.LayerNorm):
+            module.bias.data.zero_()
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, CpmBeeLayerNorm):
+            module.weight.data.fill_(1.0)
+        elif isinstance(module, CpmBeeBucketPositionBias):
+            module.relative_attention_bias.data.normal_(mean=0.0, std=self.config.init_std)
+    def _set_gradient_checkpointing(self, module, value=False):
+        if isinstance(module, CpmBeeEncoder):
+            module.gradient_checkpointing = value
+CPMBEE_START_DOCSTRING = r"""
+    This model is a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) sub-class. Use
+    it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage and
+    behavior.
+    Parameters
+        config ([`~CpmBeeConfig`]): Model configuration class with all the parameters of the
+            Initializing with a config file does not load the weights associated with the model, only the
+            configuration. Check out the [`~PreTrainedModel.from_pretrained`] method to load the model weights.
+"""
+CPMBEE_INPUTS_DOCSTRING = r"""
+    Args:
+        input_ids (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Indices of input sequence tokens in the vocabulary.
+            Indices can be obtained using [`CPMBeeTokenizer`]. See [`PreTrainedTokenizer.encode`] and
+            [`PreTrainedTokenizer.__call__`] for details.
+            [What are input IDs?](../glossary#input-ids)
+        input_id_sub (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Subscription of input sequence tokens in the vocabulary.
+            Subscription of normal text will be zero while the special tokens of each group will be the 0, 1, 2, ...
+            <ans_0>, <ans_1>, <ans_2> ... belongs to group <ans>. <mask_0>, <mask_1>, <mask_2> ... belongs to group
+            <mask>.
+        position (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            The position of input sequence tokens in the vocabulary for each segment. if segment1 is 0, 1, 2 and
+            segment2 is 0, 1, 2, 3, the position will be 0, 1, 2, 0, 1, 2, 3
+        context (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Whether this token id is context or not. If is context, the value is 1. If not, the value is 0. If a token
+            id is context, it does not need to be predicted.
+        sample_ids (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Give a sample id to every token id. The token ids with same sample ids belongs to the same sample.
+        num_segments (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Total number of segments in the current input.
+        segment (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            Give a segment id to every token id. The token ids with same segment ids belongs to the same sample.
+            Generally, a string key or value in input data will be a segment. For example, input {"input": "hello, ",
+            "<ans>": ""}, the segments includes: "input", "hello, ", "<ans>" and "".
+        segment_rel_offset (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            The offset of segment rel.
+        segment_rel (`torch.Tensor` of shape `(batch_size, seq_len)`):
+            The segment relevance. A relative implementation of measuring the importance of segments.
+        past_states (`Dict[str, Union[torch.Tensor, List]]`):
+            Store the history information including position, context, sample_ids, num_segments, segment and
+            past_key_values.
+        output_attentions (`bool`, *optional*):
+            Whether or not to return the attentions tensors of all attention layers.
+        output_hidden_states (`bool`, *optional*):
+            Whether or not to return the hidden states of all layers.
+        past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
+            A dummy arguments for CPMBee. The `past_states` contains pre-computed hidden-states (key and values in the
+            self-attention blocks and in the cross-attention blocks) that can be used (see `past_key_values` input) and
+            other history arguments to speed up sequential decoding.
+        use_cache (`bool`, *optional*):
+            If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
+            `past_key_values`).
+        labels (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
+            Labels for computing the masked language modeling loss.
+        return_dict (`bool`, *optional*):
+            Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
+"""
+@add_start_docstrings(
+    "The bare CPMBee Model outputting raw hidden-states without any specific head on top.",
+    CPMBEE_START_DOCSTRING,
+)
+class CpmBeeModel(CpmBeePreTrainedModel):
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__(config)
+        if config.half:
+            config.torch_dtype = torch.half
+        else:
+            config.torch_dtype = torch.float
+        self.encoder = CpmBeeEncoder(config)
+        self.input_embedding = CpmBeeEmbeddingExt(config)
+        self.position_bias = CpmBeeBucketPositionBias(config)
+        self.vocab_size = config.vocab_size
+        self.post_init()
+    def get_input_embeddings(self):
+        return self.input_embedding
+    def set_input_embeddings(self, embeddings, **kwargs):
+        self.input_embedding = embeddings
+    @add_start_docstrings_to_model_forward(CPMBEE_INPUTS_DOCSTRING)
+    @add_code_sample_docstrings(
+        checkpoint=_CHECKPOINT_FOR_DOC,
+        output_type=BaseModelOutputWithPast,
+        config_class=_CONFIG_FOR_DOC,
+    )
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        input_id_sub: Optional[torch.Tensor] = None,
+        position: Optional[torch.Tensor] = None,
+        context: Optional[torch.Tensor] = None,
+        sample_ids: Optional[torch.Tensor] = None,
+        num_segments: Optional[torch.Tensor] = None,
+        segment: Optional[torch.Tensor] = None,
+        segment_rel_offset: Optional[torch.Tensor] = None,
+        segment_rel: Optional[torch.Tensor] = None,
+        past_states: Optional[Dict] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        past_key_values: Optional[List] = None,
+        use_cache: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs,
+    ):
+        output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
+        output_hidden_states = (
+            output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
+        )
+        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
+        use_cache = use_cache if use_cache is not None else self.config.use_cache
+        # dummy setting for common tests
+        if input_id_sub is None:
+            dtype, device = input_ids.dtype, input_ids.device
+            batch, seq_length = input_ids.size()
+            segment = torch.where(input_ids != 0, 2, 0).to(dtype=dtype, device=device)
+            context = torch.full((batch, seq_length), 1, dtype=dtype, device=device)
+            position = torch.arange(seq_length, dtype=dtype, device=device).repeat(batch, 1)
+            input_id_sub = torch.full((batch, seq_length), 0, dtype=dtype, device=device)
+            segment_rel_offset = torch.full((batch, seq_length), 0, dtype=dtype, device=device)
+            segment_rel = torch.full((batch, seq_length), 0, dtype=dtype, device=device)
+            num_segments = torch.full((batch, seq_length), 0, dtype=dtype, device=device)
+            sample_ids = torch.zeros_like(input_ids)
+        with torch.no_grad():
+            if past_states is None:
+                present_position = position
+                present_context = context
+                present_sample_ids = sample_ids
+                present_num_segments = num_segments
+                present_segments = segment
+                present_buffer = None
+            else:
+                present_position = torch.cat([past_states["buffer_position"], position], dim=-1)
+                present_context = torch.cat([past_states["buffer_context"], context], dim=-1)
+                present_sample_ids = torch.cat([past_states["buffer_sample_ids"], sample_ids], dim=-1)
+                present_num_segments = torch.cat([past_states["buffer_num_segments"], num_segments], dim=-1)
+                present_segments = torch.cat([past_states["buffer_segments"], segment], dim=-1)
+                present_buffer = past_states["buffer"]
+            batch = input_ids.size(0)
+            len_q = input_ids.size(1)
+            len_buffer = present_position.size(1)
+            segment_rel_2d = torch.masked_fill(
+                segment[:, :, None] * num_segments[:, :, None]
+                + present_segments[:, None, :]
+                + segment_rel_offset[:, :, None],
+                ~((sample_ids[:, :, None] == present_sample_ids[:, None, :])),  # not in the same sample
+                0,  # avoid torch.gather overflow
+            ).view(batch, len_q * len_buffer)
+            segment_bucket = torch.gather(
+                input=segment_rel,
+                dim=1,
+                index=segment_rel_2d.long(),
+            ).view(batch, len_q, len_buffer)
+            segment_bucket.masked_fill_(
+                ~((sample_ids[:, :, None] == present_sample_ids[:, None, :])),  # not in the same span or sample
+                1,  # bucket is used for in-context samples
+            )
+            # directional mask
+            directional_mask_2d = present_position[:, None, :] <= position[:, :, None]
+            # sample mask
+            sample_mask_2d = (sample_ids[:, :, None] == 0) | (sample_ids[:, :, None] == present_sample_ids[:, None, :])
+            # context mask
+            attention_mask = present_context[:, None, :] | (
+                context[:, :, None].logical_not() & directional_mask_2d.view(batch, len_q, len_buffer)
+            )
+            # span mask
+            attention_mask = attention_mask & sample_mask_2d
+            # length mask
+            mask_1d = present_num_segments != 0
+            attention_mask = mask_1d.view(batch, 1, len_buffer) & attention_mask
+        hidden_states = self.input_embedding(input_ids, input_id_sub)
+        position_bias = self.position_bias(position, present_position, segment_bucket)
+        hidden_states, present_key_values, all_hidden_states, all_attentions = self.encoder(
+            hidden_states,
+            attention_mask,
+            position_bias,
+            output_attentions,
+            output_hidden_states,
+            present_buffer,
+            use_cache,
+        )
+        if not return_dict:
+            return tuple(
+                v for v in [hidden_states, present_key_values, all_hidden_states, all_attentions] if v is not None
+            )
+        return BaseModelOutputWithPast(
+            last_hidden_state=hidden_states,
+            past_key_values=present_key_values,
+            hidden_states=all_hidden_states,
+            attentions=all_attentions,
+        )
+class CpmBeeBeamHypotheses(BeamHypotheses):
+    def __init__(self, num_beams: int, length_penalty: float, early_stopping: bool, max_length: Optional[int] = None):
+        """
+        Override BeamHypotheses for CpmBee. The hyp to add is list but not tensor.
+        """
+        super().__init__(num_beams, length_penalty, early_stopping, max_length)
+    def add(self, hyp: List, sum_logprobs: float, beam_indices: Optional[torch.LongTensor] = None):
+        """
+        Add a new hypothesis to the list.
+        """
+        score = sum_logprobs / (len(hyp) ** self.length_penalty)
+        if len(self) < self.num_beams or score > self.worst_score:
+            self.beams.append((score, hyp, beam_indices))
+            if len(self) > self.num_beams:
+                sorted_next_scores = sorted([(s, idx) for idx, (s, _, _) in enumerate(self.beams)])
+                del self.beams[sorted_next_scores[0][1]]
+                self.worst_score = sorted_next_scores[1][0]
+            else:
+                self.worst_score = min(score, self.worst_score)
+class CPMBeeTransBlock(torch.nn.Module):
+    def __init__(
+        self,
+        dim_model=4096,
+        dim_ff=1024,
+        dim_out=768,
+        dtype=torch.float,
+        eps=1e-6,
+        dropout_p=0,
+    ):
+        super().__init__()
+        if dropout_p is not None:
+            self.dropout = torch.nn.Dropout(dropout_p)
+        else:
+            self.dropout = None
+        self.w_out_res = torch.nn.Linear(dim_model, dim_out, bias=False)
+        self.layernorm = torch.nn.LayerNorm(
+            dim_out,
+            dtype=dtype,
+            eps=eps,
+        )
+    def forward(self, hidden_states: torch.Tensor):
+        x_res = self.w_out_res(hidden_states)
+        if self.dropout is not None:
+            x_res = self.dropout(x_res)
+        hidden_states = self.layernorm(x_res)
+        return hidden_states
+class CpmBeeWithTransform(CpmBeePreTrainedModel):
+    _keys_to_ignore_on_load_missing = [r"lm_head.weight"]
+    def __init__(self, config: CpmBeeConfig):
+        super().__init__(config)
+        self.llm = CpmBeeModel(config)
+        self.trans_block = CPMBeeTransBlock(config.hidden_size, config.hidden_size // 4, config.unet_cross_attention_dim)
+    def forward(
+        self,
+        input_ids: torch.Tensor,
+        input_id_sub: Optional[torch.Tensor] = None,
+        position: Optional[torch.Tensor] = None,
+        context: Optional[torch.Tensor] = None,
+        sample_ids: Optional[torch.Tensor] = None,
+        num_segments: Optional[torch.Tensor] = None,
+        segment: Optional[torch.Tensor] = None,
+        segment_rel_offset: Optional[torch.Tensor] = None,
+        segment_rel: Optional[torch.Tensor] = None,
+        past_states: Optional[Dict] = None,
+        output_attentions: Optional[bool] = None,
+        output_hidden_states: Optional[bool] = None,
+        past_key_values: Optional[List] = None,
+        use_cache: Optional[bool] = None,
+        return_dict: Optional[bool] = None,
+        **kwargs,
+    ):
+        outputs = self.llm(input_ids, input_id_sub, position, context,
+            sample_ids, num_segments, segment, segment_rel_offset,
+            segment_rel, past_states, output_attentions, output_hidden_states,
+            past_key_values, use_cache, return_dict, **kwargs,)
+        if return_dict:
+            hidden_states = outputs.last_hidden_state
+        else:
+            hidden_states = outputs[0]
+        #if self.trans_block is not None:
+        #    hidden_states = self.trans_block(hidden_states)
+        return outputs, hidden_states

pipeline_stable_diffusion.py ADDED Viewed

	@@ -0,0 +1,723 @@

+# Copyright 2023 The HuggingFace Team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+import warnings
+from typing import Any, Callable, Dict, List, Optional, Union, Tuple
+import numpy as np
+import torch
+from torch.utils.data.dataloader import default_collate
+from packaging import version
+from transformers import CLIPImageProcessor, CLIPTextModel, CLIPTokenizer
+from diffusers.configuration_utils import FrozenDict
+from diffusers.image_processor import VaeImageProcessor
+from diffusers.loaders import FromSingleFileMixin, LoraLoaderMixin, TextualInversionLoaderMixin
+from diffusers.models import AutoencoderKL, UNet2DConditionModel
+from diffusers.schedulers import KarrasDiffusionSchedulers
+from diffusers.utils import (
+    deprecate,
+    is_accelerate_available,
+    is_accelerate_version,
+    logging,
+    randn_tensor,
+    replace_example_docstring,
+)
+from diffusers.pipeline_utils import DiffusionPipeline
+from diffusers.pipelines.stable_diffusion.safety_checker import StableDiffusionSafetyChecker
+from diffusers.pipelines.stable_diffusion import StableDiffusionPipelineOutput
+from diffusers.pipelines.stable_diffusion.pipeline_stable_diffusion import rescale_noise_cfg, StableDiffusionPipeline
+from .modeling_cpmbee import CpmBeeModel
+from .tokenization_viscpmbee import VisCpmBeeTokenizer
+logger = logging.get_logger(__name__)  # pylint: disable=invalid-name
+def pad(orig_items, key, max_length=None, padding_value=0, padding_side="left"):
+    items = []
+    if isinstance(orig_items[0][key], list):
+        assert isinstance(orig_items[0][key][0], torch.Tensor)
+        for it in orig_items:
+            for tr in it[key]:
+                items.append({key: tr})
+    else:
+        assert isinstance(orig_items[0][key], torch.Tensor)
+        items = orig_items
+    batch_size = len(items)
+    shape = items[0][key].shape
+    dim = len(shape)
+    assert dim <= 3
+    if max_length is None:
+        max_length = 0
+    max_length = max(max_length, max(item[key].shape[-1] for item in items))
+    min_length = min(item[key].shape[-1] for item in items)
+    dtype = items[0][key].dtype
+    if dim == 1:
+        return torch.cat([item[key] for item in items], dim=0)
+    elif dim == 2:
+        if max_length == min_length:
+            return torch.cat([item[key] for item in items], dim=0)
+        tensor = torch.zeros((batch_size, max_length), dtype=dtype) + padding_value
+    else:
+        tensor = torch.zeros((batch_size, max_length, shape[-1]), dtype=dtype) + padding_value
+    for i, item in enumerate(items):
+        if dim == 2:
+            if padding_side == "left":
+                tensor[i, -len(item[key][0]):] = item[key][0].clone()
+            else:
+                tensor[i, : len(item[key][0])] = item[key][0].clone()
+        elif dim == 3:
+            if padding_side == "left":
+                tensor[i, -len(item[key][0]):, :] = item[key][0].clone()
+            else:
+                tensor[i, : len(item[key][0]), :] = item[key][0].clone()
+    return tensor
+class CPMBeeCollater:
+    """
+    针对 cpmbee 输入数据 collate, 对应 cpm-live 的 _MixedDatasetBatchPacker
+    目前利用 torch 的原生 Dataloader 不太适合改造 in-context-learning
+    并且原来实现为了最大化提高有效 token 比比例, 会有一个 best_fit 操作, 这个目前也不支持
+    todo: 重写一下 Dataloader or BatchPacker
+    """
+    def __init__(self, tokenizer: VisCpmBeeTokenizer, max_len):
+        self.tokenizer = tokenizer
+        self._max_length = max_len
+        self.pad_keys = ['input_ids', 'input_id_subs', 'context', 'segment_ids', 'segment_rel_offset',
+                         'segment_rel', 'sample_ids', 'num_segments']
+    def __call__(self, batch):
+        batch_size = len(batch)
+        tgt = np.full((batch_size, self._max_length), -100, dtype=np.int32)
+        # 目前没有 best_fit, span 为全 0
+        span = np.zeros((batch_size, self._max_length), dtype=np.int32)
+        length = np.zeros((batch_size,), dtype=np.int32)
+        batch_ext_table_map: Dict[Tuple[int, int], int] = {}
+        batch_ext_table_ids: List[int] = []
+        batch_ext_table_sub: List[int] = []
+        raw_data_list: List[Any] = []
+        for i in range(batch_size):
+            instance_length = batch[i]['input_ids'][0].shape[0]
+            length[i] = instance_length
+            raw_data_list.extend(batch[i]['raw_data'])
+            for j in range(instance_length):
+                idx, idx_sub = batch[i]['input_ids'][0, j], batch[i]['input_id_subs'][0, j]
+                tgt_idx = idx
+                if idx_sub > 0:
+                    # need to be in ext table
+                    if (idx, idx_sub) not in batch_ext_table_map:
+                        batch_ext_table_map[(idx, idx_sub)] = len(batch_ext_table_map)
+                        batch_ext_table_ids.append(idx)
+                        batch_ext_table_sub.append(idx_sub)
+                    tgt_idx = batch_ext_table_map[(idx, idx_sub)] + self.tokenizer.vocab_size
+                if j > 1 and batch[i]['context'][0, j - 1] == 0:
+                    if idx != self.tokenizer.bos_id:
+                        tgt[i, j - 1] = tgt_idx
+                    else:
+                        tgt[i, j - 1] = self.tokenizer.eos_id
+            if batch[i]['context'][0, instance_length - 1] == 0:
+                tgt[i, instance_length - 1] = self.tokenizer.eos_id
+        if len(batch_ext_table_map) == 0:
+            # placeholder
+            batch_ext_table_ids.append(0)
+            batch_ext_table_sub.append(1)
+        # image
+        if 'pixel_values' in batch[0]:
+            data = {'pixel_values': default_collate([i['pixel_values'] for i in batch])}
+        else:
+            data = {}
+        # image_bound
+        if 'image_bound' in batch[0]:
+            data['image_bound'] = default_collate([i['image_bound'] for i in batch])
+        # bee inp
+        for key in self.pad_keys:
+            data[key] = pad(batch, key, max_length=self._max_length, padding_value=0, padding_side='right')
+        data['context'] = data['context'] > 0
+        data['length'] = torch.from_numpy(length)
+        data['span'] = torch.from_numpy(span)
+        data['target'] = torch.from_numpy(tgt)
+        data['ext_table_ids'] = torch.from_numpy(np.array(batch_ext_table_ids))
+        data['ext_table_sub'] = torch.from_numpy(np.array(batch_ext_table_sub))
+        data['raw_data'] = raw_data_list
+        return data
+class VisCPMPaintBeePipeline(StableDiffusionPipeline):
+    _optional_components = ["safety_checker", "feature_extractor"]
+    def __init__(
+        self,
+        vae: AutoencoderKL,
+        text_encoder: CpmBeeModel,
+        tokenizer: VisCpmBeeTokenizer,
+        unet: UNet2DConditionModel,
+        scheduler: KarrasDiffusionSchedulers,
+        safety_checker: StableDiffusionSafetyChecker,
+        feature_extractor: CLIPImageProcessor,
+        requires_safety_checker: bool = True,
+    ):
+        super().__init__(
+            vae=vae,
+            text_encoder=text_encoder,
+            tokenizer=tokenizer,
+            unet=unet,
+            scheduler=scheduler,
+            safety_checker=safety_checker,
+            feature_extractor=feature_extractor,
+            requires_safety_checker=requires_safety_checker
+        )
+    def build_input(
+        self,
+        prompt: str,
+        negative_prompt: Optional[str] = None,
+        image_size: int = 512
+    ):
+        data_input = {'caption': prompt, 'objects': ''}
+        (
+            input_ids,
+            input_id_subs,
+            context,
+            segment_ids,
+            segment_rel,
+            n_segments,
+            table_states,
+            image_bound
+        ) = self.tokenizer.convert_data_to_id(data=data_input, shuffle_answer=False, max_depth=8)
+        sample_ids = np.zeros(input_ids.shape, dtype=np.int32)
+        segment_rel_offset = np.zeros(input_ids.shape, dtype=np.int32)
+        num_segments = np.full(input_ids.shape, n_segments, dtype=np.int32)
+        data = {
+            'pixel_values': torch.zeros(3, image_size, image_size).unsqueeze(0),
+            'input_ids': torch.from_numpy(input_ids).unsqueeze(0),
+            'input_id_subs': torch.from_numpy(input_id_subs).unsqueeze(0),
+            'context': torch.from_numpy(context).unsqueeze(0),
+            'segment_ids': torch.from_numpy(segment_ids).unsqueeze(0),
+            'segment_rel_offset': torch.from_numpy(segment_rel_offset).unsqueeze(0),
+            'segment_rel': torch.from_numpy(segment_rel).unsqueeze(0),
+            'sample_ids': torch.from_numpy(sample_ids).unsqueeze(0),
+            'num_segments': torch.from_numpy(num_segments).unsqueeze(0),
+            'image_bound': image_bound,
+            'raw_data': prompt,
+        }
+        uncond_data_input = {
+            'caption': "" if negative_prompt is None else negative_prompt,
+            'objects': ''
+        }
+        (
+            input_ids,
+            input_id_subs,
+            context,
+            segment_ids,
+            segment_rel,
+            n_segments,
+            table_states,
+            image_bound
+        ) = self.tokenizer.convert_data_to_id(data=uncond_data_input, shuffle_answer=False, max_depth=8)
+        sample_ids = np.zeros(input_ids.shape, dtype=np.int32)
+        segment_rel_offset = np.zeros(input_ids.shape, dtype=np.int32)
+        num_segments = np.full(input_ids.shape, n_segments, dtype=np.int32)
+        uncond_data = {
+            'pixel_values': torch.zeros(3, image_size, image_size).unsqueeze(0),
+            'input_ids': torch.from_numpy(input_ids).unsqueeze(0),
+            'input_id_subs': torch.from_numpy(input_id_subs).unsqueeze(0),
+            'context': torch.from_numpy(context).unsqueeze(0),
+            'segment_ids': torch.from_numpy(segment_ids).unsqueeze(0),
+            'segment_rel_offset': torch.from_numpy(segment_rel_offset).unsqueeze(0),
+            'segment_rel': torch.from_numpy(segment_rel).unsqueeze(0),
+            'sample_ids': torch.from_numpy(sample_ids).unsqueeze(0),
+            'num_segments': torch.from_numpy(num_segments).unsqueeze(0),
+            'image_bound': image_bound,
+            'raw_data': "" if negative_prompt is None else negative_prompt,
+        }
+        packer = CPMBeeCollater(
+            tokenizer=self.tokenizer,
+            max_len=max(data['input_ids'].size(-1), uncond_data['input_ids'].size(-1))
+        )
+        data = packer([data])
+        uncond_data = packer([uncond_data])
+        return data, uncond_data
+    def _encode_prompt(
+        self,
+        prompt,
+        device,
+        num_images_per_prompt,
+        do_classifier_free_guidance,
+        negative_prompt=None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        lora_scale: Optional[float] = None,
+    ):
+        r"""
+        Encodes the prompt into text encoder hidden states.
+        Args:
+             prompt (`str` or `List[str]`, *optional*):
+                prompt to be encoded
+            device: (`torch.device`):
+                torch device
+            num_images_per_prompt (`int`):
+                number of images that should be generated per prompt
+            do_classifier_free_guidance (`bool`):
+                whether to use classifier free guidance or not
+            negative_prompt (`str` or `List[str]`, *optional*):
+                The prompt or prompts not to guide the image generation. If not defined, one has to pass
+                `negative_prompt_embeds` instead. Ignored when not using guidance (i.e., ignored if `guidance_scale` is
+                less than `1`).
+            prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt weighting. If not
+                provided, text embeddings will be generated from `prompt` input argument.
+            negative_prompt_embeds (`torch.FloatTensor`, *optional*):
+                Pre-generated negative text embeddings. Can be used to easily tweak text inputs, *e.g.* prompt
+                weighting. If not provided, negative_prompt_embeds will be generated from `negative_prompt` input
+                argument.
+            lora_scale (`float`, *optional*):
+                A lora scale that will be applied to all LoRA layers of the text encoder if LoRA layers are loaded.
+        """
+        # set lora scale so that monkey patched LoRA
+        # function of text encoder can correctly access it
+        if lora_scale is not None and isinstance(self, LoraLoaderMixin):
+            self._lora_scale = lora_scale
+        data, uncond_data = self.build_input(prompt, negative_prompt, image_size=512)
+        for key, value in data.items():
+            if isinstance(value, torch.Tensor):
+                data[key] = value.to(self.device)
+        for key, value in uncond_data.items():
+            if isinstance(value, torch.Tensor):
+                uncond_data[key] = value.to(self.device)
+        batch, seq_length = data['input_ids'].size()
+        dtype, device = data['input_ids'].dtype, data['input_ids'].device
+        data['position'] = torch.arange(seq_length, dtype=dtype, device=device).repeat(batch, 1)
+        batch, seq_length = uncond_data['input_ids'].size()
+        dtype, device = uncond_data['input_ids'].dtype, uncond_data['input_ids'].device
+        uncond_data['position'] = torch.arange(seq_length, dtype=dtype, device=device).repeat(batch, 1)
+        with torch.no_grad():
+            # llm_hidden_state = self.text_encoder.llm.input_embedding(data['input_ids'], data['input_id_subs'])
+            _, hidden_states = self.text_encoder(
+                input_ids=data['input_ids'],
+                input_id_sub=data['input_id_subs'],
+                position=data['position'],
+                #length=data['length'],
+                context=data['context'],
+                sample_ids=data['sample_ids'],
+                num_segments=data['num_segments'],
+                segment=data['segment_ids'],
+                segment_rel_offset=data['segment_rel_offset'],
+                segment_rel=data['segment_rel'],
+                #span=data['span'],
+                #ext_table_ids=data['ext_table_ids'],
+                #ext_table_sub=data['ext_table_sub'],
+                #hidden_states=llm_hidden_state
+            )
+        with torch.no_grad():
+            # uncond_llm_hidden_state = self.text_encoder.llm.input_embedding(uncond_data['input_ids'], uncond_data['input_id_subs'])
+            _, uncond_hidden_states = self.text_encoder(
+                input_ids=uncond_data['input_ids'],
+                input_id_sub=uncond_data['input_id_subs'],
+                position=uncond_data['position'],
+                #length=uncond_data['length'],
+                context=uncond_data['context'],
+                sample_ids=uncond_data['sample_ids'],
+                num_segments=uncond_data['num_segments'],
+                segment=uncond_data['segment_ids'],
+                segment_rel_offset=uncond_data['segment_rel_offset'],
+                segment_rel=uncond_data['segment_rel'],
+                #span=uncond_data['span'],
+                #ext_table_ids=uncond_data['ext_table_ids'],
+                #ext_table_sub=uncond_data['ext_table_sub'],
+                #hidden_states=uncond_llm_hidden_state
+            )
+        text_hidden_states, uncond_text_hidden_states = hidden_states, uncond_hidden_states
+        if self.text_encoder.trans_block is not None:
+            text_hidden_states = self.text_encoder.trans_block(text_hidden_states)
+            uncond_text_hidden_states = self.text_encoder.trans_block(uncond_text_hidden_states)
+        bs_embed, seq_len, _ = text_hidden_states.shape
+        text_hidden_states = text_hidden_states.repeat(1, num_images_per_prompt, 1)
+        text_hidden_states = text_hidden_states.view(bs_embed * num_images_per_prompt, seq_len, -1)
+        bs_embed, seq_len, _ = uncond_text_hidden_states.shape
+        uncond_text_hidden_states = uncond_text_hidden_states.repeat(1, num_images_per_prompt, 1)
+        uncond_text_hidden_states = uncond_text_hidden_states.view(bs_embed * num_images_per_prompt, seq_len, -1)
+        prompt_embeds = torch.cat([uncond_text_hidden_states, text_hidden_states])
+        return prompt_embeds
+        # if prompt is not None and isinstance(prompt, str):
+        #     batch_size = 1
+        # elif prompt is not None and isinstance(prompt, list):
+        #     batch_size = len(prompt)
+        # else:
+        #     batch_size = prompt_embeds.shape[0]
+        # if prompt_embeds is None:
+        #     # textual inversion: procecss multi-vector tokens if necessary
+        #     if isinstance(self, TextualInversionLoaderMixin):
+        #         prompt = self.maybe_convert_prompt(prompt, self.tokenizer)
+        #     text_inputs = self.tokenizer(
+        #         prompt,
+        #         padding="max_length",
+        #         max_length=self.tokenizer.model_max_length,
+        #         truncation=True,
+        #         return_tensors="pt",
+        #     )
+        #     text_input_ids = text_inputs.input_ids
+        #     untruncated_ids = self.tokenizer(prompt, padding="longest", return_tensors="pt").input_ids
+        #     if untruncated_ids.shape[-1] >= text_input_ids.shape[-1] and not torch.equal(
+        #         text_input_ids, untruncated_ids
+        #     ):
+        #         removed_text = self.tokenizer.batch_decode(
+        #             untruncated_ids[:, self.tokenizer.model_max_length - 1 : -1]
+        #         )
+        #         logger.warning(
+        #             "The following part of your input was truncated because CLIP can only handle sequences up to"
+        #             f" {self.tokenizer.model_max_length} tokens: {removed_text}"
+        #         )
+        #     if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask:
+        #         attention_mask = text_inputs.attention_mask.to(device)
+        #     else:
+        #         attention_mask = None
+        #     prompt_embeds = self.text_encoder(
+        #         text_input_ids.to(device),
+        #         attention_mask=attention_mask,
+        #     )
+        #     prompt_embeds = prompt_embeds[0]
+        # prompt_embeds = prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)
+        # bs_embed, seq_len, _ = prompt_embeds.shape
+        # # duplicate text embeddings for each generation per prompt, using mps friendly method
+        # prompt_embeds = prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        # prompt_embeds = prompt_embeds.view(bs_embed * num_images_per_prompt, seq_len, -1)
+        # # get unconditional embeddings for classifier free guidance
+        # if do_classifier_free_guidance and negative_prompt_embeds is None:
+        #     uncond_tokens: List[str]
+        #     if negative_prompt is None:
+        #         uncond_tokens = [""] * batch_size
+        #     elif prompt is not None and type(prompt) is not type(negative_prompt):
+        #         raise TypeError(
+        #             f"`negative_prompt` should be the same type to `prompt`, but got {type(negative_prompt)} !="
+        #             f" {type(prompt)}."
+        #         )
+        #     elif isinstance(negative_prompt, str):
+        #         uncond_tokens = [negative_prompt]
+        #     elif batch_size != len(negative_prompt):
+        #         raise ValueError(
+        #             f"`negative_prompt`: {negative_prompt} has batch size {len(negative_prompt)}, but `prompt`:"
+        #             f" {prompt} has batch size {batch_size}. Please make sure that passed `negative_prompt` matches"
+        #             " the batch size of `prompt`."
+        #         )
+        #     else:
+        #         uncond_tokens = negative_prompt
+        #     # textual inversion: procecss multi-vector tokens if necessary
+        #     if isinstance(self, TextualInversionLoaderMixin):
+        #         uncond_tokens = self.maybe_convert_prompt(uncond_tokens, self.tokenizer)
+        #     max_length = prompt_embeds.shape[1]
+        #     uncond_input = self.tokenizer(
+        #         uncond_tokens,
+        #         padding="max_length",
+        #         max_length=max_length,
+        #         truncation=True,
+        #         return_tensors="pt",
+        #     )
+        #     if hasattr(self.text_encoder.config, "use_attention_mask") and self.text_encoder.config.use_attention_mask:
+        #         attention_mask = uncond_input.attention_mask.to(device)
+        #     else:
+        #         attention_mask = None
+        #     negative_prompt_embeds = self.text_encoder(
+        #         uncond_input.input_ids.to(device),
+        #         attention_mask=attention_mask,
+        #     )
+        #     negative_prompt_embeds = negative_prompt_embeds[0]
+        # if do_classifier_free_guidance:
+        #     # duplicate unconditional embeddings for each generation per prompt, using mps friendly method
+        #     seq_len = negative_prompt_embeds.shape[1]
+        #     negative_prompt_embeds = negative_prompt_embeds.to(dtype=self.text_encoder.dtype, device=device)
+        #     negative_prompt_embeds = negative_prompt_embeds.repeat(1, num_images_per_prompt, 1)
+        #     negative_prompt_embeds = negative_prompt_embeds.view(batch_size * num_images_per_prompt, seq_len, -1)
+        #     # For classifier free guidance, we need to do two forward passes.
+        #     # Here we concatenate the unconditional and text embeddings into a single batch
+        #     # to avoid doing two forward passes
+        #     prompt_embeds = torch.cat([negative_prompt_embeds, prompt_embeds])
+        # return prompt_embeds
+    def decode_latents(self, latents):
+        warnings.warn(
+            "The decode_latents method is deprecated and will be removed in a future version. Please"
+            " use VaeImageProcessor instead",
+            FutureWarning,
+        )
+        latents = 1 / self.vae.config.scaling_factor * latents
+        image = self.vae.decode(latents, return_dict=False)[0]
+        image = (image / 2 + 0.5).clamp(0, 1)
+        # we always cast to float32 as this does not cause significant overhead and is compatible with bfloat16
+        image = image.cpu().permute(0, 2, 3, 1).float().numpy()
+        return image
+    def prepare_extra_step_kwargs(self, generator, eta):
+        # prepare extra kwargs for the scheduler step, since not all schedulers have the same signature
+        # eta (η) is only used with the DDIMScheduler, it will be ignored for other schedulers.
+        # eta corresponds to η in DDIM paper: https://arxiv.org/abs/2010.02502
+        # and should be between [0, 1]
+        accepts_eta = "eta" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        extra_step_kwargs = {}
+        if accepts_eta:
+            extra_step_kwargs["eta"] = eta
+        # check if the scheduler accepts generator
+        accepts_generator = "generator" in set(inspect.signature(self.scheduler.step).parameters.keys())
+        if accepts_generator:
+            extra_step_kwargs["generator"] = generator
+        return extra_step_kwargs
+    def check_inputs(
+        self,
+        prompt,
+        height,
+        width,
+        callback_steps,
+        negative_prompt=None,
+        prompt_embeds=None,
+        negative_prompt_embeds=None,
+    ):
+        if height % 8 != 0 or width % 8 != 0:
+            raise ValueError(f"`height` and `width` have to be divisible by 8 but are {height} and {width}.")
+        if (callback_steps is None) or (
+            callback_steps is not None and (not isinstance(callback_steps, int) or callback_steps <= 0)
+        ):
+            raise ValueError(
+                f"`callback_steps` has to be a positive integer but is {callback_steps} of type"
+                f" {type(callback_steps)}."
+            )
+        if prompt is not None and prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `prompt`: {prompt} and `prompt_embeds`: {prompt_embeds}. Please make sure to"
+                " only forward one of the two."
+            )
+        elif prompt is None and prompt_embeds is None:
+            raise ValueError(
+                "Provide either `prompt` or `prompt_embeds`. Cannot leave both `prompt` and `prompt_embeds` undefined."
+            )
+        elif prompt is not None and (not isinstance(prompt, str) and not isinstance(prompt, list)):
+            raise ValueError(f"`prompt` has to be of type `str` or `list` but is {type(prompt)}")
+        if negative_prompt is not None and negative_prompt_embeds is not None:
+            raise ValueError(
+                f"Cannot forward both `negative_prompt`: {negative_prompt} and `negative_prompt_embeds`:"
+                f" {negative_prompt_embeds}. Please make sure to only forward one of the two."
+            )
+        if prompt_embeds is not None and negative_prompt_embeds is not None:
+            if prompt_embeds.shape != negative_prompt_embeds.shape:
+                raise ValueError(
+                    "`prompt_embeds` and `negative_prompt_embeds` must have the same shape when passed directly, but"
+                    f" got: `prompt_embeds` {prompt_embeds.shape} != `negative_prompt_embeds`"
+                    f" {negative_prompt_embeds.shape}."
+                )
+    def prepare_latents(self, batch_size, num_channels_latents, height, width, dtype, device, generator, latents=None):
+        shape = (batch_size, num_channels_latents, height // self.vae_scale_factor, width // self.vae_scale_factor)
+        if isinstance(generator, list) and len(generator) != batch_size:
+            raise ValueError(
+                f"You have passed a list of generators of length {len(generator)}, but requested an effective batch"
+                f" size of {batch_size}. Make sure the batch size matches the length of the generators."
+            )
+        if latents is None:
+            latents = randn_tensor(shape, generator=generator, device=device, dtype=dtype)
+        else:
+            latents = latents.to(device)
+        # scale the initial noise by the standard deviation required by the scheduler
+        latents = latents * self.scheduler.init_noise_sigma
+        return latents
+    @torch.no_grad()
+    def __call__(
+        self,
+        prompt: Union[str, List[str]] = None,
+        height: Optional[int] = None,
+        width: Optional[int] = None,
+        num_inference_steps: int = 50,
+        guidance_scale: float = 7.5,
+        negative_prompt: Optional[Union[str, List[str]]] = None,
+        num_images_per_prompt: Optional[int] = 1,
+        eta: float = 0.0,
+        generator: Optional[Union[torch.Generator, List[torch.Generator]]] = None,
+        latents: Optional[torch.FloatTensor] = None,
+        prompt_embeds: Optional[torch.FloatTensor] = None,
+        negative_prompt_embeds: Optional[torch.FloatTensor] = None,
+        output_type: Optional[str] = "pil",
+        return_dict: bool = True,
+        callback: Optional[Callable[[int, int, torch.FloatTensor], None]] = None,
+        callback_steps: int = 1,
+        cross_attention_kwargs: Optional[Dict[str, Any]] = None,
+        guidance_rescale: float = 0.0,
+    ):
+        # 0. Default height and width to unet
+        height = height or self.unet.config.sample_size * self.vae_scale_factor
+        width = width or self.unet.config.sample_size * self.vae_scale_factor
+        # 1. Check inputs. Raise error if not correct
+        self.check_inputs(
+            prompt, height, width, callback_steps, negative_prompt, prompt_embeds, negative_prompt_embeds
+        )
+        # 2. Define call parameters
+        if prompt is not None and isinstance(prompt, str):
+            batch_size = 1
+        elif prompt is not None and isinstance(prompt, list):
+            batch_size = len(prompt)
+        else:
+            batch_size = prompt_embeds.shape[0]
+        device = self._execution_device
+        # here `guidance_scale` is defined analog to the guidance weight `w` of equation (2)
+        # of the Imagen paper: https://arxiv.org/pdf/2205.11487.pdf . `guidance_scale = 1`
+        # corresponds to doing no classifier free guidance.
+        do_classifier_free_guidance = guidance_scale > 1.0
+        # 3. Encode input prompt
+        text_encoder_lora_scale = (
+            cross_attention_kwargs.get("scale", None) if cross_attention_kwargs is not None else None
+        )
+        prompt_embeds = self._encode_prompt(
+            prompt,
+            device,
+            num_images_per_prompt,
+            do_classifier_free_guidance,
+            negative_prompt,
+            prompt_embeds=prompt_embeds,
+            negative_prompt_embeds=negative_prompt_embeds,
+            lora_scale=text_encoder_lora_scale,
+        )
+        # 4. Prepare timesteps
+        self.scheduler.set_timesteps(num_inference_steps, device=device)
+        timesteps = self.scheduler.timesteps
+        # 5. Prepare latent variables
+        num_channels_latents = self.unet.config.in_channels
+        latents = self.prepare_latents(
+            batch_size * num_images_per_prompt,
+            num_channels_latents,
+            height,
+            width,
+            prompt_embeds.dtype,
+            device,
+            generator,
+            latents,
+        )
+        # 6. Prepare extra step kwargs. TODO: Logic should ideally just be moved out of the pipeline
+        extra_step_kwargs = self.prepare_extra_step_kwargs(generator, eta)
+        # 7. Denoising loop
+        num_warmup_steps = len(timesteps) - num_inference_steps * self.scheduler.order
+        with self.progress_bar(total=num_inference_steps) as progress_bar:
+            for i, t in enumerate(timesteps):
+                # expand the latents if we are doing classifier free guidance
+                latent_model_input = torch.cat([latents] * 2) if do_classifier_free_guidance else latents
+                latent_model_input = self.scheduler.scale_model_input(latent_model_input, t)
+                # predict the noise residual
+                noise_pred = self.unet(
+                    latent_model_input,
+                    t,
+                    encoder_hidden_states=prompt_embeds,
+                    cross_attention_kwargs=cross_attention_kwargs,
+                    return_dict=False,
+                )[0]
+                # perform guidance
+                if do_classifier_free_guidance:
+                    noise_pred_uncond, noise_pred_text = noise_pred.chunk(2)
+                    noise_pred = noise_pred_uncond + guidance_scale * (noise_pred_text - noise_pred_uncond)
+                if do_classifier_free_guidance and guidance_rescale > 0.0:
+                    # Based on 3.4. in https://arxiv.org/pdf/2305.08891.pdf
+                    noise_pred = rescale_noise_cfg(noise_pred, noise_pred_text, guidance_rescale=guidance_rescale)
+                # compute the previous noisy sample x_t -> x_t-1
+                latents = self.scheduler.step(noise_pred, t, latents, **extra_step_kwargs, return_dict=False)[0]
+                # call the callback, if provided
+                if i == len(timesteps) - 1 or ((i + 1) > num_warmup_steps and (i + 1) % self.scheduler.order == 0):
+                    progress_bar.update()
+                    if callback is not None and i % callback_steps == 0:
+                        callback(i, t, latents)
+        if not output_type == "latent":
+            image = self.vae.decode(latents / self.vae.config.scaling_factor, return_dict=False)[0]
+            image, has_nsfw_concept = self.run_safety_checker(image, device, prompt_embeds.dtype)
+        else:
+            image = latents
+            has_nsfw_concept = None
+        if has_nsfw_concept is None:
+            do_denormalize = [True] * image.shape[0]
+        else:
+            do_denormalize = [not has_nsfw for has_nsfw in has_nsfw_concept]
+        image = self.image_processor.postprocess(image, output_type=output_type, do_denormalize=do_denormalize)
+        # Offload last model to CPU
+        if hasattr(self, "final_offload_hook") and self.final_offload_hook is not None:
+            self.final_offload_hook.offload()
+        if not return_dict:
+            return (image, has_nsfw_concept)
+        return StableDiffusionPipelineOutput(images=image, nsfw_content_detected=has_nsfw_concept)

scheduler/scheduler_config.json ADDED Viewed

	@@ -0,0 +1,14 @@

+{
+  "_class_name": "DDIMScheduler",
+  "_diffusers_version": "0.8.0",
+  "beta_end": 0.012,
+  "beta_schedule": "scaled_linear",
+  "beta_start": 0.00085,
+  "clip_sample": false,
+  "num_train_timesteps": 1000,
+  "prediction_type": "epsilon",
+  "set_alpha_to_one": false,
+  "skip_prk_steps": true,
+  "steps_offset": 1,
+  "trained_betas": null
+}

tokenization_viscpmbee.py ADDED Viewed

	@@ -0,0 +1,1008 @@

+# coding=utf-8
+# Copyright 2022 The OpenBMB Team and The HuggingFace Inc. team. All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Tokenization classes for CpmBee."""
+import json
+import os
+from typing import Any, Dict, List, Optional, Tuple, Union
+import numpy as np
+from numpy.typing import NDArray
+from typing_extensions import TypedDict
+from transformers.tokenization_utils import PaddingStrategy, PreTrainedTokenizer, TensorType
+from transformers.tokenization_utils_base import AddedToken, BatchEncoding, TextInput, TruncationStrategy
+from transformers.utils import logging
+logger = logging.get_logger(__name__)
+VOCAB_FILES_NAMES = {"vocab_file": "vocab.txt"}
+PRETRAINED_VOCAB_FILES_MAP = {
+    "vocab_file": {
+        "openbmb/viscpmchat-bee-10b": "https://huggingface.co/openbmb/VisCPM-Chat/blob/main/vocab.txt",
+    },
+}
+PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES = {
+    "openbmb/viscpmchat-bee-10b": 4096,
+}
+class _PrevExtTableStates(TypedDict):
+    ext_table: Dict[int, str]
+    token_id_table: Dict[str, Dict[int, int]]
+CPMBeeInputType = Union[str, Dict[str, "CPMBeeInputType"]]
+def rel_to_bucket(n_up: int, n_down: int, max_depth: int = 8):
+    ret = n_up * max_depth + n_down
+    if ret == 0:
+        return ret
+    else:
+        # bucket 1 is reserved for incontext samples
+        return ret + 1
+class _DictTree(TypedDict):
+    value: str
+    children: List["_DictTree"]
+    depth: int
+    segment_id: int
+    need_predict: bool
+    is_image: bool
+class VisCpmBeeTokenizer(PreTrainedTokenizer):
+    """
+    Construct a CPMBee tokenizer.
+    Args:
+        vocab_file (`str`):
+            Path to the vocabulary file.
+        bos_token (`str`, *optional*, defaults to `"<s>"`):
+            The beginning of sequence token.
+        eos_token (`str`, *optional*, defaults to `"</s>"`):
+            The end of sequence token.
+        line_token (`str`, *optional*, defaults to `"\n"`):
+            The line token.
+        space_token (`str`, *optional*, defaults to `" "`):
+            The space token.
+        unk_token (`str`, *optional*, defaults to `"<unk>"`):
+            The unknown token.
+        mask_token (`str`, *optional*, defaults to `"<mask>"`):
+            The mask token.
+        pad_token (`str`, *optional*, defaults to `"<pad>"`):
+            The token used for padding.
+        padding_side (`str`, *optional*, defaults to `"left"`):
+            The padding side. CPM-Bee will use left padding by default.
+    """
+    vocab_files_names = VOCAB_FILES_NAMES
+    pretrained_vocab_files_map = PRETRAINED_VOCAB_FILES_MAP
+    max_model_input_sizes = PRETRAINED_POSITIONAL_EMBEDDINGS_SIZES
+    model_input_names: List[str] = [
+        "input_ids",
+        "attention_mask",
+        "input_id_sub",
+        "position",
+        "context",
+        "sample_ids",
+        "num_segments",
+        "segment",
+        "segment_rel_offset",
+        "segment_rel",
+    ]
+    add_prefix_space = False
+    def __init__(
+        self,
+        vocab_file,
+        bos_token="<s>",
+        eos_token="</s>",
+        line_token="\n",
+        space_token=" ",
+        unk_token="<unk>",
+        mask_token="<mask>",
+        pad_token="<pad>",
+        padding_side="left",
+        **kwargs,
+    ):
+        super().__init__(
+            bos_token=bos_token,
+            eos_token=eos_token,
+            line_token=line_token,
+            space_token=space_token,
+            unk_token=unk_token,
+            mask_token=mask_token,
+            pad_token=pad_token,
+            padding_side=padding_side,
+            **kwargs,
+        )
+        self.encoder: Dict[str, int] = {}
+        with open(vocab_file, "r", encoding="utf-8") as reader:
+            for token in reader.readlines():
+                token = token.rstrip("\n")
+                if len(token) == 0:
+                    continue
+                self.encoder[token] = len(self.encoder)
+        self.encoder[" "] = self.encoder["</_>"]
+        self.encoder["\n"] = self.encoder["</n>"]
+        del self.encoder["</_>"]
+        del self.encoder["</n>"]
+        self.decoder = {v: k for k, v in self.encoder.items()}
+        self._max_word_len = max([len(x) for x in self.encoder.keys()])
+        self.cpmbee_special_tokens = {k: v for k, v in self.encoder.items() if k.startswith("<") and k.endswith(">")}
+        self.ext_table: Dict[int, str] = {}
+        self.ext_table_rev: Dict[str, int] = {}
+        self.token_id_table: Dict[str, Dict[int, int]] = {}
+        self.ext_special_tokens = []
+        self.ext_args_for_model = [
+            "input_id_subs",
+            "input_pos",
+            "context",
+            "segment_ids",
+            "segment_rel_offset",
+            "segment_rel",
+            "sample_ids",
+            "num_segments",
+            "predict_segments",
+            "answer_placeholders",
+            "ext_table",
+            "token_id_table",
+            "image_bound"
+        ]
+    @property
+    def bod_token_id(self):
+        return self.encoder[self.bod_token]
+    @property
+    def eod_token_id(self):
+        return self.encoder[self.eod_token]
+    @property
+    def newline_id(self):
+        return self.encoder[self.line_token]
+    @property
+    def vocab_size(self) -> int:
+        return len(self.encoder)
+    def __len__(self):
+        """
+        Size of the full vocabulary with the added tokens.
+        """
+        return self.vocab_size + len(self.added_tokens_encoder)
+    def get_vocab(self):
+        return dict(self.encoder, **self.added_tokens_encoder)
+    def get_piece(self, text: str) -> str:
+        """
+        Match with maximum length.
+        """
+        len_text = len(text)
+        for i in range(len(text)):
+            sub = text[: len_text - i]
+            if (sub in self.encoder) or (sub in self.added_tokens_encoder):
+                return sub
+        return text[0]
+    def tokenize(self, text: TextInput, **kwargs) -> List[str]:
+        r"""
+        Override the `tokenize` to meet the needs of CPMBee:
+        1. Mark the special token with `<` and `>`. The `<>` will be ignored.
+        2. Split sentences by the marked special tokens.
+        3. Record the marked special token by `ext_table` and `ext_table_rev`.
+        4. Tokenize the sentence without special tokens.
+        """
+        for_cpmbee = kwargs.get("for_cpmbee", False)
+        all_special_tokens_extended = {
+            str(t): t for t in self.all_special_tokens_extended if isinstance(t, AddedToken)
+        }
+        sentence_split = [""]
+        is_special_token = False
+        for i, c in enumerate(text):
+            if is_special_token:
+                if c == "<":
+                    tail = sentence_split.pop(-1)
+                    sentence_split[-1] += tail
+                    sentence_split.append(c)
+                elif c == ">":
+                    # end of special token
+                    sentence_split[-1] += c
+                    if sentence_split[-1] == "<>":
+                        continue
+                    is_special_token = False
+                    sentence_split.append("")
+                else:
+                    sentence_split[-1] += c
+            else:
+                if c == "<":
+                    is_special_token = True
+                    sentence_split.append(c)
+                else:
+                    sentence_split[-1] += c
+        if is_special_token:
+            tail = sentence_split.pop(-1)
+            sentence_split[-1] += tail
+        output_tokens = []
+        for i, part in enumerate(sentence_split):
+            if (i & 1) == 1:
+                # special token
+                output_tokens.append(part)
+                if for_cpmbee and (part not in self.encoder) and (part not in self.ext_table_rev):
+                    self.ext_table_rev[part] = len(self.ext_table_rev) + self.vocab_size
+                    self.ext_table[self.ext_table_rev[part]] = part
+            else:
+                output_tokens.extend(self._tokenize(part, for_cpmbee=for_cpmbee))
+        # drop spaces
+        for i, token in enumerate(output_tokens):
+            if token in self.added_tokens_encoder:
+                token = all_special_tokens_extended.get(token, None)
+                left = output_tokens[i - 1] if i > 0 else None
+                right = output_tokens[i + 1] if i < len(output_tokens) - 1 else None
+                if isinstance(token, AddedToken):
+                    if token.rstrip and right:
+                        # A bit counter-intuitive but we strip the left of the string
+                        # since tok_extended.rstrip means the special token is eating all white spaces on its right
+                        output_tokens[i + 1] = right.lstrip()
+                    # Strip white spaces on the left
+                    if token.lstrip and left:
+                        output_tokens[i - 1] = left.rstrip()  # Opposite here
+                else:
+                    if right:
+                        output_tokens[i + 1] = right.lstrip()
+                    if left:
+                        output_tokens[i - 1] = left.rstrip()
+        skipped_tokens = []
+        for token in output_tokens:
+            if not token:
+                continue
+            else:
+                skipped_tokens.append(token)
+        return skipped_tokens
+    def _tokenize(self, text, **kwargs):
+        """
+        Converts a string in a sequence of tokens (string), using the tokenizer. Split in words for word-based
+        vocabulary.
+        Do NOT take care of added tokens. Record the unk tokens and special tokens in `ext_table` and `ext_table_rev`.
+        """
+        for_cpmbee = kwargs.get("for_cpmbee", False)
+        output_tokens = []
+        part_st = 0
+        last_unk = None
+        while part_st < len(text):
+            piece = self.get_piece(text[part_st:])
+            if piece in self.encoder or self.added_tokens_encoder:
+                if last_unk is None:
+                    output_tokens.append(piece)
+                else:
+                    if for_cpmbee and (last_unk not in self.ext_table_rev):
+                        self.ext_table_rev[last_unk] = len(self.ext_table_rev) + self.vocab_size
+                        self.ext_table[self.ext_table_rev[last_unk]] = last_unk
+                    output_tokens.append(last_unk)
+                    output_tokens.append(piece)
+                    last_unk = None
+            else:
+                if last_unk is None:
+                    last_unk = piece
+                else:
+                    last_unk += piece
+            part_st += len(piece)
+        if last_unk is not None:
+            # part end with UNK
+            if for_cpmbee and (last_unk not in self.ext_table_rev):
+                self.ext_table_rev[last_unk] = len(self.ext_table_rev) + self.vocab_size
+                self.ext_table[self.ext_table_rev[last_unk]] = last_unk
+            output_tokens.append(last_unk)
+        return output_tokens
+    def check(self, token):
+        return token in self.encoder
+    def convert_tokens_to_string(self, tokens: List[str]) -> str:
+        return "".join(tokens)
+    def _convert_token_to_id(self, token: str):
+        """Converts a token (str) in an id using the vocab and ext_table."""
+        if token in self.encoder:
+            return self.encoder.get(token)
+        elif token in self.ext_table_rev:
+            return self.ext_table_rev[token]
+        elif token in self.added_tokens_encoder:
+            return self.added_tokens_encoder[token]
+        else:
+            return self.unk_token_id
+    def _convert_id_to_token(self, index):
+        """Converts an index (integer) in a token (str) using the vocab and ext_table."""
+        if index in self.ext_table:
+            return self.ext_table[index]
+        elif index in self.added_tokens_decoder:
+            return self.added_tokens_decoder[index]
+        else:
+            if index >= 0:
+                return self.decoder[index]
+    def save_vocabulary(self, save_directory: str, filename_prefix: Optional[str] = None) -> Tuple[str]:
+        if os.path.isdir(save_directory):
+            vocab_file = os.path.join(
+                save_directory, (filename_prefix + "-" if filename_prefix else "") + VOCAB_FILES_NAMES["vocab_file"]
+            )
+        else:
+            vocab_file = (filename_prefix + "-" if filename_prefix else "") + save_directory
+        index = 0
+        self.encoder["</n>"] = self.encoder["\n"]
+        del self.encoder["\n"]
+        self.encoder["</_>"] = self.encoder[" "]
+        del self.encoder[" "]
+        with open(vocab_file, "w", encoding="utf-8") as writer:
+            for token, token_index in sorted(self.encoder.items(), key=lambda x: x[1]):
+                if index != token_index:
+                    logger.warning(
+                        f"Saving vocabulary to {vocab_file}: vocabulary indices are not consecutive."
+                        " Please check that the vocabulary is not corrupted!"
+                    )
+                    index = token_index
+                writer.write(token + "\n")
+                index += 1
+        return (vocab_file,)
+    def __call__(self, text, *args, **kwargs):
+        r"""
+        CPMBee `call` method will use `_tokenize_cpmbee` when the input type is dict.
+        """
+        if isinstance(text, dict):
+            return self._batch_tokenize_cpmbee([text], *args, **kwargs)
+        elif isinstance(text, (list, tuple)):
+            if isinstance(text[0], dict):
+                return self._batch_tokenize_cpmbee(text, *args, **kwargs)
+            else:
+                return super().__call__(text, *args, **kwargs)
+        else:
+            return super().__call__(text, *args, **kwargs)
+    # 分词
+    def _tokenize_cpmbee(self, data: TextInput, *args, **kwargs) -> List[str]:
+        """
+        A tokenize method to process dict data. Exclusive for CPMBee.
+        """
+        if isinstance(data, str):
+            data = json.loads(data)
+        if not isinstance(data, Dict):
+            raise TypeError(
+                "CpmBeeTokenizer input data should be dict or str in dict format, but got {}".format(type(data))
+            )
+        # 1. prepare answer placeholder
+        answer_placeholders = []
+        def _put_placeholder(data: Any, path: List[str] = []):
+            if isinstance(data, dict):
+                ret = {}
+                for k, v in data.items():
+                    ret[k] = _put_placeholder(v, path + [k])
+                return ret
+            else:
+                answer_placeholders.append(path)
+                return "<ans_{}>".format(len(answer_placeholders))
+        data["<ans>"] = _put_placeholder(data["<ans>"])
+        (
+            input_ids,
+            input_id_subs,
+            context,
+            segment_ids,
+            segment_rel,
+            n_segments,
+            table_states,
+            image_bound
+        ) = self.convert_data_to_id(data, shuffle_answer=False, max_depth=8)
+        # <ans> mapping from sub to id
+        sub_ans_map: Dict[int, int] = {}
+        for fake_id, token_sub in table_states["token_id_table"]["<ans>"].items():
+            token = table_states["ext_table"][fake_id]
+            if token.startswith("<ans_") and token.endswith(">"):
+                ans_id = int(token[5:-1])
+                sub_ans_map[token_sub] = ans_id
+        tmp_input_ids = []
+        tmp_input_sub = []
+        tmp_input_seg = []
+        # get predict segments
+        predict_segments: List[Tuple[int, int]] = []
+        for i in range(input_ids.shape[0]):
+            if context[i] == 0:
+                if input_ids[i] == self.encoder["<ans>"]:
+                    # is ans
+                    # (segment_id, ans_id)
+                    predict_segments.append((segment_ids[i], sub_ans_map[input_id_subs[i]]))
+            else:
+                tmp_input_ids.append(input_ids[i])
+                tmp_input_sub.append(input_id_subs[i])
+                tmp_input_seg.append(segment_ids[i])
+        if len(predict_segments) == 0:
+            raise ValueError("No answer to predict")
+        input_ids = np.array(tmp_input_ids, dtype=np.int32)  # all context
+        input_id_subs = np.array(tmp_input_sub, dtype=np.int32)  # [0, 0, 0, 0, 1, 0, 0, 2, 0, ...]
+        context = np.full_like(tmp_input_ids, 1, dtype=np.int8)  # [1, 1, 1, ...]
+        segment_ids = np.array(tmp_input_seg, dtype=np.int32)  # [0, 0, 0, 1, 1, 1, 2, 2, 2, 2, ...]
+        sample_ids = np.zeros(input_ids.shape, dtype=np.int32)  # [0, 0, 0, 0, ...]
+        segment_rel_offset = np.zeros(input_ids.shape, dtype=np.int32)  # [0, 0, 0, ...]
+        num_segments = np.full(input_ids.shape, n_segments, dtype=np.int32)  # [n_seg, n_seg, n_seg, ...]
+        input_pos = np.arange(input_ids.shape[0], dtype=np.int32)  # [0, 1, 2, 3, 4, ...]
+        image_bound = np.array(image_bound)
+        return (
+            self.prepare_for_model(
+                input_ids.tolist(),
+                input_id_subs=input_id_subs.tolist(),
+                input_pos=input_pos.tolist(),
+                context=context.tolist(),
+                segment_ids=segment_ids.tolist(),
+                segment_rel_offset=segment_rel_offset.tolist(),
+                segment_rel=segment_rel.tolist(),
+                sample_ids=sample_ids.tolist(),
+                num_segments=num_segments.tolist(),
+                image_bound=image_bound,
+                **kwargs,
+            ),
+            predict_segments,
+            answer_placeholders,
+            table_states["ext_table"],
+            table_states["token_id_table"],
+        )
+    def _batch_tokenize_cpmbee(self, data_lst, *args, **kwargs):
+        """
+        Batched _token_cpmbee.
+        """
+        device = kwargs.get("device", "cpu")
+        return_tensors = kwargs.get("return_tensors", None)
+        batch_outputs = {}
+        segment_rel_pack = []
+        other_info = []
+        batch_ext_table_map: Dict[Tuple[int, int], int] = {}
+        batch_ext_table_ids: List[int] = []
+        batch_ext_table_sub: List[int] = []
+        for data in data_lst:
+            self.ext_table = {}
+            self.ext_table_rev = {}
+            self.token_id_table = {}
+            (outputs, predict_segments, answer_placeholders, ext_table, token_id_table) = self._tokenize_cpmbee(
+                data,
+                truncation=None,
+                padding=PaddingStrategy.DO_NOT_PAD.value,
+                max_length=None,
+                pad_to_multiple_of=None,
+                return_attention_mask=False,
+                return_tensors=None,
+            )
+            rev_ext_table = {}
+            for token, mp in token_id_table.items():
+                if token == "<ans>":
+                    continue
+                token_id = self.encoder[token]
+                for fake_id, token_sub in mp.items():
+                    if token_sub > 0:
+                        if (token_id, token_sub) not in batch_ext_table_map:
+                            batch_ext_table_map[(token_id, token_sub)] = len(batch_ext_table_ids) + self.vocab_size
+                            batch_ext_table_ids.append(token_id)
+                            batch_ext_table_sub.append(token_sub)
+                        rev_ext_table[batch_ext_table_map[(token_id, token_sub)]] = ext_table[fake_id]
+                    else:
+                        rev_ext_table[token_id] = ext_table[fake_id]
+            segment_rel_pack.append(np.array(outputs.pop("segment_rel")))
+            other_info.append(
+                {
+                    "predict_segments": predict_segments,
+                    "answer_placeholders": answer_placeholders,
+                    "ext_table": rev_ext_table,
+                }
+            )
+            for key, value in outputs.items():
+                if key not in batch_outputs:
+                    batch_outputs[key] = []
+                batch_outputs[key].append(value)
+        max_length = max([len(item) for item in batch_outputs[self.model_input_names[0]]])
+        batch_size = len(batch_outputs[self.model_input_names[0]])
+        for i in range(batch_size):
+            inputs = {k: v[i] for k, v in batch_outputs.items()}
+            for k, v in inputs.items():
+                required_input = v
+                needs_to_be_padded = len(required_input) != max_length and k != 'image_bound'
+                if needs_to_be_padded:
+                    difference = max_length - len(required_input)
+                    batch_outputs[k][i] = [self.pad_token_id] * difference + required_input
+        max_num_rels = 0
+        for rel in segment_rel_pack:
+            max_num_rels = max(max_num_rels, rel.shape[0])
+        padded_rels = np.zeros((len(segment_rel_pack), max_num_rels), dtype=np.int32)
+        for i, rel in enumerate(segment_rel_pack):
+            padded_rels[i, : rel.shape[0]] = rel
+        batch_outputs["segment_rel"] = padded_rels
+        batch_outputs["batch_ext_table_ids"] = np.array(batch_ext_table_ids, dtype=np.int32)
+        batch_outputs["batch_ext_table_sub"] = np.array(batch_ext_table_sub, dtype=np.int32)
+        batch_outputs = BatchEncoding(batch_outputs, tensor_type=return_tensors)
+        if return_tensors == "pt":
+            batch_outputs = batch_outputs.to(device=device)
+        batch_outputs["other_info"] = other_info
+        return batch_outputs
+    def convert_data_to_id(
+        self,
+        data: Any,
+        prev_ext_states: Optional[_PrevExtTableStates] = None,
+        shuffle_answer: bool = True,
+        max_depth: int = 8,
+    ):
+        """
+        Parse a dict to data ids. Exclusive for CPMBee. It will
+        1. parse the dict to segments and get segment_rel, which for calculating of position_bias.
+        2. tokenize every segment.
+        """
+        root: _DictTree = {
+            "value": "<root>",
+            "children": [],
+            "depth": 0,
+            "segment_id": 0,
+            "need_predict": False,
+            "is_image": False
+        }
+        segments = [root]
+        def _build_dict_tree(data: CPMBeeInputType, depth: int, need_predict: bool, is_image: bool) -> List[_DictTree]:
+            if isinstance(data, dict):
+                ret_list: List[_DictTree] = []
+                curr_items = list(data.items())
+                if need_predict and shuffle_answer:
+                    access_idx = np.arange(len(curr_items))
+                    np.random.shuffle(access_idx)
+                    curr_items = [curr_items[idx] for idx in access_idx]
+                for k, v in curr_items:
+                    child_info: _DictTree = {
+                        "value": k,
+                        "children": [],
+                        "depth": depth,
+                        "segment_id": len(segments),
+                        "need_predict": False,  # only leaves are contexts
+                        "is_image": False,
+                    }
+                    segments.append(child_info)
+                    child_info["children"] = _build_dict_tree(
+                        v, depth + 1,
+                        need_predict=need_predict or (depth == 1 and k == "<ans>"),
+                        is_image=is_image or (depth == 1 and k == "image")
+                    )  # elements in <root>.<ans>
+                    ret_list.append(child_info)
+                return ret_list
+            else:
+                assert isinstance(data, str), "Invalid data {}".format(data)
+                ret: _DictTree = {
+                    "value": data,
+                    "children": [],
+                    "depth": depth,
+                    "segment_id": len(segments),
+                    "need_predict": need_predict,
+                    "is_image": is_image,
+                }
+                segments.append(ret)
+                return [ret]
+        root["children"] = _build_dict_tree(data, 1, False, False)
+        num_segments = len(segments)
+        segment_rel = np.zeros((num_segments * num_segments,), dtype=np.int32)
+        def _build_segment_rel(node: _DictTree) -> List[Tuple[int, int]]:
+            ret: List[Tuple[int, int]] = [(node["segment_id"], node["depth"])]
+            for child in node["children"]:
+                sub = _build_segment_rel(child)
+                for seg_id_1, depth_1 in sub:
+                    for seg_id_2, depth_2 in ret:
+                        n_up = min(depth_1 - node["depth"], max_depth - 1)
+                        n_down = min(depth_2 - node["depth"], max_depth - 1)
+                        segment_rel[seg_id_1 * num_segments + seg_id_2] = rel_to_bucket(
+                            n_up, n_down, max_depth=max_depth
+                        )
+                        segment_rel[seg_id_2 * num_segments + seg_id_1] = rel_to_bucket(
+                            n_down, n_up, max_depth=max_depth
+                        )
+                ret.extend(sub)
+            return ret
+        _build_segment_rel(root)
+        input_ids: List[int] = []
+        input_id_subs: List[int] = []
+        segment_bound: List[Tuple[int, int]] = []
+        image_bound: List[Tuple[int, int]] = []
+        if prev_ext_states is not None:
+            self.ext_table = prev_ext_states["ext_table"]
+            self.token_id_table = prev_ext_states["token_id_table"]
+        for seg in segments:
+            # tokenize
+            tokens = self.convert_tokens_to_ids(self.tokenize(seg["value"], for_cpmbee=True))
+            token_id_subs = []
+            reid_token_ids = []
+            for idx in tokens:
+                if idx in self.ext_table:
+                    # unk or special token
+                    token = self.ext_table[idx]
+                    if token.startswith("<") and token.endswith(">"):
+                        # special token
+                        if "_" in token:
+                            token_name = token[1:-1].split("_", maxsplit=1)[0]
+                        else:
+                            token_name = token[1:-1]
+                        token_name = "<{}>".format(token_name)
+                    else:
+                        token_name = "<unk>"
+                    if token_name not in self.token_id_table:
+                        self.token_id_table[token_name] = {}
+                    if idx not in self.token_id_table[token_name]:
+                        self.token_id_table[token_name][idx] = len(self.token_id_table[token_name])
+                    if token_name not in self.encoder:
+                        raise ValueError("Invalid token {}".format(token))
+                    reid_token_ids.append(self.encoder[token_name])
+                    token_id_subs.append(self.token_id_table[token_name][idx])
+                else:
+                    reid_token_ids.append(idx)
+                    token_id_subs.append(0)
+            tokens = [self.bos_token_id] + reid_token_ids
+            token_id_subs = [0] + token_id_subs
+            # eos_id 表示 no need_predict
+            if not seg["need_predict"]:  # eos
+                tokens = tokens + [self.eos_token_id]
+                token_id_subs = token_id_subs + [0]
+            else:
+                # no eos
+                pass
+            begin = len(input_ids)
+            input_ids.extend(tokens)
+            input_id_subs.extend(token_id_subs)
+            end = len(input_ids)
+            segment_bound.append((begin, end))
+        ids = np.array(input_ids, dtype=np.int32)
+        id_subs = np.array(input_id_subs, dtype=np.int32)
+        segs = np.zeros((ids.shape[0],), dtype=np.int32)  # 按segment_bound对seg编号
+        context = np.zeros((ids.shape[0],), dtype=np.int8)
+        for i, (begin, end) in enumerate(segment_bound):
+            if not segments[i]["need_predict"]:
+                context[begin:end] = 1
+            if segments[i]["is_image"]:
+                image_bound.append((begin + 1, end - 1))
+            segs[begin:end] = i
+        curr_ext_table_states: _PrevExtTableStates = {
+            "ext_table": self.ext_table,
+            "token_id_table": self.token_id_table,
+        }
+        image_bound = np.array(image_bound, dtype=np.int32)
+        return ids, id_subs, context, segs, segment_rel, num_segments, curr_ext_table_states, image_bound
+    def prepare_for_model(
+        self,
+        ids: List[int],
+        pair_ids: Optional[List[int]] = None,
+        add_special_tokens: bool = True,
+        padding: Union[bool, str, PaddingStrategy] = False,
+        truncation: Union[bool, str, TruncationStrategy] = None,
+        max_length: Optional[int] = None,
+        stride: int = 0,
+        pad_to_multiple_of: Optional[int] = None,
+        return_tensors: Optional[Union[str, TensorType]] = None,
+        return_token_type_ids: Optional[bool] = None,
+        return_attention_mask: Optional[bool] = None,
+        return_overflowing_tokens: bool = False,
+        return_special_tokens_mask: bool = False,
+        return_length: bool = False,
+        verbose: bool = True,
+        prepend_batch_axis: bool = False,
+        **kwargs,
+    ) -> BatchEncoding:
+        """
+        Prepares a sequence of input id, or a pair of sequences of inputs ids so that it can be used by the model. It
+        adds special tokens, truncates sequences if overflowing while taking into account the special tokens and
+        manages a moving window (with user defined stride) for overflowing tokens. Please Note, for *pair_ids*
+        different than `None` and *truncation_strategy = longest_first* or `True`, it is not possible to return
+        overflowing tokens. Such a combination of arguments will raise an error.
+        Args:
+            ids (`List[int]`):
+                Tokenized input ids of the first sequence. Can be obtained from a string by chaining the `tokenize` and
+                `convert_tokens_to_ids` methods.
+            pair_ids (`List[int]`, *optional*):
+                Tokenized input ids of the second sequence. Can be obtained from a string by chaining the `tokenize`
+                and `convert_tokens_to_ids` methods.
+        """
+        # Backward compatibility for 'truncation_strategy', 'pad_to_max_length'
+        padding_strategy, truncation_strategy, max_length, kwargs = self._get_padding_truncation_strategies(
+            padding=padding,
+            truncation=truncation,
+            max_length=max_length,
+            pad_to_multiple_of=pad_to_multiple_of,
+            verbose=verbose,
+            **kwargs,
+        )
+        pair = bool(pair_ids is not None)
+        len_ids = len(ids)
+        len_pair_ids = len(pair_ids) if pair else 0
+        if return_token_type_ids and not add_special_tokens:
+            raise ValueError(
+                "Asking to return token_type_ids while setting add_special_tokens to False "
+                "results in an undefined behavior. Please set add_special_tokens to True or "
+                "set return_token_type_ids to None."
+            )
+        if (
+            return_overflowing_tokens
+            and truncation_strategy == TruncationStrategy.LONGEST_FIRST
+            and pair_ids is not None
+        ):
+            raise ValueError(
+                "Not possible to return overflowing tokens for pair of sequences with the "
+                "`longest_first`. Please select another truncation strategy than `longest_first`, "
+                "for instance `only_second` or `only_first`."
+            )
+        # Load from model defaults
+        if return_token_type_ids is None:
+            return_token_type_ids = "token_type_ids" in self.model_input_names
+        if return_attention_mask is None:
+            return_attention_mask = "attention_mask" in self.model_input_names
+        encoded_inputs = {}
+        # Compute the total size of the returned encodings
+        total_len = len_ids + len_pair_ids + (self.num_special_tokens_to_add(pair=pair) if add_special_tokens else 0)
+        # Truncation: Handle max sequence length
+        overflowing_tokens = []
+        if truncation_strategy != TruncationStrategy.DO_NOT_TRUNCATE and max_length and total_len > max_length:
+            ids, pair_ids, overflowing_tokens = self.truncate_sequences(
+                ids,
+                pair_ids=pair_ids,
+                num_tokens_to_remove=total_len - max_length,
+                truncation_strategy=truncation_strategy,
+                stride=stride,
+            )
+        if return_overflowing_tokens:
+            encoded_inputs["overflowing_tokens"] = overflowing_tokens
+            encoded_inputs["num_truncated_tokens"] = total_len - max_length
+        # Add special tokens
+        if add_special_tokens:
+            sequence = self.build_inputs_with_special_tokens(ids, pair_ids)
+            token_type_ids = self.create_token_type_ids_from_sequences(ids, pair_ids)
+        else:
+            sequence = ids + pair_ids if pair else ids
+            token_type_ids = [0] * len(ids) + ([0] * len(pair_ids) if pair else [])
+        # Build output dictionary
+        encoded_inputs["input_ids"] = sequence
+        if return_token_type_ids:
+            encoded_inputs["token_type_ids"] = token_type_ids
+        if return_special_tokens_mask:
+            if add_special_tokens:
+                encoded_inputs["special_tokens_mask"] = self.get_special_tokens_mask(ids, pair_ids)
+            else:
+                encoded_inputs["special_tokens_mask"] = [0] * len(sequence)
+        # Check lengths
+        self._eventual_warn_about_too_long_sequence(encoded_inputs["input_ids"], max_length, verbose)
+        # Padding
+        if padding_strategy != PaddingStrategy.DO_NOT_PAD or return_attention_mask:
+            encoded_inputs = self.pad(
+                encoded_inputs,
+                max_length=max_length,
+                padding=padding_strategy.value,
+                pad_to_multiple_of=pad_to_multiple_of,
+                return_attention_mask=return_attention_mask,
+            )
+        if return_length:
+            encoded_inputs["length"] = len(encoded_inputs["input_ids"])
+        # for CPMBee, encode all the model arguments
+        for arg in self.ext_args_for_model:
+            v = kwargs.get(arg, None)
+            if v is not None:
+                encoded_inputs[arg] = v
+        batch_outputs = BatchEncoding(
+            encoded_inputs, tensor_type=return_tensors, prepend_batch_axis=prepend_batch_axis
+        )
+        return batch_outputs
+    def prepare_for_finetune(
+        self,
+        data_list: List[Dict],
+        max_length: int = 2048
+    ):
+        _inputs: List[NDArray[np.int32]] = []
+        _inputs_sub: List[NDArray[np.int32]] = []
+        _context: List[NDArray[np.int8]] = []
+        _sample_ids: List[NDArray[np.int32]] = []
+        _segments: List[NDArray[np.int32]] = []
+        _num_segments: List[NDArray[np.int32]] = []
+        _segment_rel_offset: List[NDArray[np.int32]] = []
+        _segment_rel: List[NDArray[np.int32]] = []
+        _spans: List[List[int]] = []
+        _raw_data: List[List[Any]] = []
+        raw_data = {}
+        for data in data_list:
+            (
+                input_ids,
+                input_id_subs,
+                context,
+                segment_ids,
+                segment_rel,
+                n_segments,
+                _
+            ) = self.convert_data_to_id(data)
+            input_ids = input_ids[: max_length]
+            context = context[: max_length]
+            segment_ids = segment_ids[: max_length]
+            raw_data["input"] = data
+            raw_data["samples"] = []
+            sample_ids = np.zeros(input_ids.shape, dtype=np.int32)
+            segment_rel_offset = np.zeros(input_ids.shape, dtype=np.int32)
+            num_segments = np.full(input_ids.shape, n_segments, dtype=np.int32)
+            _inputs.append(input_ids)
+            _inputs_sub.append(input_id_subs)
+            _context.append(context)
+            _sample_ids.append(sample_ids)
+            _segments.append(segment_ids)
+            _num_segments.append(num_segments)
+            _segment_rel_offset.append(segment_rel_offset)
+            _segment_rel.append(segment_rel)
+            _spans.append([input_ids.shape[0]])
+            _raw_data.append([raw_data])
+        batch_size = len(_inputs)
+        inputs = np.zeros((batch_size, max_length), dtype=np.int32)
+        inputs_sub = np.zeros((batch_size, max_length), dtype=np.int32)
+        context = np.zeros((batch_size, max_length), dtype=np.int8)
+        sample_ids = np.zeros((batch_size, max_length), dtype=np.int32)
+        segments = np.zeros((batch_size, max_length), dtype=np.int32)
+        num_segments = np.zeros((batch_size, max_length), dtype=np.int32)
+        segment_rel_offset = np.zeros((batch_size, max_length), dtype=np.int32)
+        tgt = np.full((batch_size, max_length), -100, dtype=np.int32)
+        max_rel = 0
+        for i in range(batch_size):
+            max_rel = max(max_rel, _segment_rel[i].shape[0])
+        segment_rel = np.zeros((batch_size, max_rel), dtype=np.int32)
+        spans = np.zeros((batch_size, max_length), dtype=np.int32)
+        length = np.zeros((batch_size,), dtype=np.int32)
+        batch_ext_table_map: Dict[Tuple[int, int], int] = {}
+        batch_ext_table_ids: List[int] = []
+        batch_ext_table_sub: List[int] = []
+        raw_data_list: List[Any] = []
+        for i in range(batch_size):
+            instance_length = _inputs[i].shape[0]
+            rel_size = _segment_rel[i].shape[0]
+            inputs[i, :instance_length] = _inputs[i]
+            inputs_sub[i, :instance_length] = _inputs_sub[i]
+            context[i, :instance_length] = _context[i]
+            sample_ids[i, :instance_length] = _sample_ids[i]
+            segments[i, :instance_length] = _segments[i]
+            num_segments[i, :instance_length] = _num_segments[i]
+            segment_rel_offset[i, :instance_length] = _segment_rel_offset[i]
+            segment_rel[i, :rel_size] = _segment_rel[i]
+            span_begin = 0
+            for span_id, span_end in enumerate(_spans[i]):
+                spans[i, span_begin:span_end] = span_id
+                span_begin = span_end
+            length[i] = instance_length
+            raw_data_list.extend(_raw_data[i])
+            for j in range(instance_length):
+                idx, idx_sub = _inputs[i][j], _inputs_sub[i][j]
+                tgt_idx = idx
+                if idx_sub > 0:
+                    # need to be in ext table
+                    if (idx, idx_sub) not in batch_ext_table_map:
+                        batch_ext_table_map[(idx, idx_sub)] = len(batch_ext_table_map)
+                        batch_ext_table_ids.append(idx)
+                        batch_ext_table_sub.append(idx_sub)
+                    tgt_idx = batch_ext_table_map[(idx, idx_sub)] + self.vocab_size
+                if j > 1 and context[i, j - 1] == 0:
+                    if idx != self.bos_token_id:
+                        tgt[i, j - 1] = tgt_idx
+                    else:
+                        tgt[i, j - 1] = self.eos_token_id
+            if context[i, instance_length - 1] == 0:
+                tgt[i, instance_length - 1] = self.eos_token_id
+        if len(batch_ext_table_map) == 0:
+            # placeholder
+            batch_ext_table_ids.append(0)
+            batch_ext_table_sub.append(1)
+        return BatchEncoding({
+            "input_ids": inputs,
+            "input_id_sub": inputs_sub,
+            "length": length,
+            "context": context > 0,
+            "sample_ids": sample_ids,
+            "num_segments": num_segments,
+            "segment": segments,
+            "segment_rel_offset": segment_rel_offset,
+            "segment_rel": segment_rel,
+            "span": spans,
+            "labels": tgt,
+            "ext_table_ids": np.array(batch_ext_table_ids, dtype=np.int32),
+            "ext_table_sub": np.array(batch_ext_table_sub, dtype=np.int32)
+        }, tensor_type="pt")

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+    "name_or_path": "openbmb/cpm-bee-10b",
+    "tokenizer_class": "CpmBeeTokenizer",
+    "auto_map": {
+        "AutoTokenizer": [
+            "tokenization_viscpmbee.VisCpmBeeTokenizer",
+            null
+        ]
+    }
+}

unet/config.json ADDED Viewed

	@@ -0,0 +1,45 @@

+{
+  "_class_name": "UNet2DConditionModel",
+  "_diffusers_version": "0.10.0.dev0",
+  "act_fn": "silu",
+  "attention_head_dim": [
+    5,
+    10,
+    20,
+    20
+  ],
+  "block_out_channels": [
+    320,
+    640,
+    1280,
+    1280
+  ],
+  "center_input_sample": false,
+  "cross_attention_dim": 1024,
+  "down_block_types": [
+    "CrossAttnDownBlock2D",
+    "CrossAttnDownBlock2D",
+    "CrossAttnDownBlock2D",
+    "DownBlock2D"
+  ],
+  "downsample_padding": 1,
+  "dual_cross_attention": false,
+  "flip_sin_to_cos": true,
+  "freq_shift": 0,
+  "in_channels": 4,
+  "layers_per_block": 2,
+  "mid_block_scale_factor": 1,
+  "norm_eps": 1e-05,
+  "norm_num_groups": 32,
+  "num_class_embeds": null,
+  "only_cross_attention": false,
+  "out_channels": 4,
+  "sample_size": 64,
+  "up_block_types": [
+    "UpBlock2D",
+    "CrossAttnUpBlock2D",
+    "CrossAttnUpBlock2D",
+    "CrossAttnUpBlock2D"
+  ],
+  "use_linear_projection": true
+}

vae/config.json ADDED Viewed

	@@ -0,0 +1,29 @@

+{
+  "_class_name": "AutoencoderKL",
+  "_diffusers_version": "0.8.0",
+  "act_fn": "silu",
+  "block_out_channels": [
+    128,
+    256,
+    512,
+    512
+  ],
+  "down_block_types": [
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D",
+    "DownEncoderBlock2D"
+  ],
+  "in_channels": 3,
+  "latent_channels": 4,
+  "layers_per_block": 2,
+  "norm_num_groups": 32,
+  "out_channels": 3,
+  "sample_size": 768,
+  "up_block_types": [
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D",
+    "UpDecoderBlock2D"
+  ]
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff