Upload Typhoon2Audio2AudioForConditionalGeneration

Browse files

Files changed (10) hide show

README.md +199 -0
config.json +135 -0
configuration_typhoon2audio.py +166 -0
generation_config.json +4 -0
modeling_typhoon2audio.py +0 -0
pytorch_model-00001-of-00004.bin +3 -0
pytorch_model-00002-of-00004.bin +3 -0
pytorch_model-00003-of-00004.bin +3 -0
pytorch_model-00004-of-00004.bin +3 -0
pytorch_model.bin.index.json +0 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

config.json ADDED Viewed

	@@ -0,0 +1,135 @@

+{
+  "architectures": [
+    "Typhoon2Audio2AudioForConditionalGeneration"
+  ],
+  "attention_bias": false,
+  "attention_dropout": 0.0,
+  "auto_map": {
+    "AutoConfig": "configuration_typhoon2audio.Typhoon2AudioConfig",
+    "AutoModel": "modeling_typhoon2audio.Typhoon2Audio2AudioForConditionalGeneration"
+  },
+  "beats": {
+    "model_type": ""
+  },
+  "ctc_decoder_config": "(4,4096,32,11008)",
+  "ctc_loss_weight": 1.0,
+  "ctc_upsample_factor": 25,
+  "head_dim": 128,
+  "hidden_act": "silu",
+  "hidden_size": 4096,
+  "intermediate_size": 14336,
+  "llama_base_model": "scb10x/typhoon-2-llama31-8b-instruct-beta-v1",
+  "max_position_embeddings": 131072,
+  "mlp_bias": false,
+  "model_type": "typhoon2audio",
+  "num_attention_heads": 32,
+  "num_hidden_layers": 32,
+  "num_key_value_heads": 8,
+  "pretraining_tp": 1,
+  "rms_norm_eps": 1e-05,
+  "rope_scaling": {
+    "factor": 8.0,
+    "high_freq_factor": 4.0,
+    "low_freq_factor": 1.0,
+    "original_max_position_embeddings": 8192,
+    "rope_type": "llama3"
+  },
+  "rope_theta": 500000.0,
+  "second_per_frame": 0.333333,
+  "second_stride": 0.333333,
+  "speech_decoder_ignore_index": -100,
+  "speech_qformer_layer": 2,
+  "speech_qformer_token_num": 1,
+  "torch_dtype": "float16",
+  "transformers_version": "4.45.0",
+  "unit_vocab_size": 1000,
+  "vocab_size": 128256,
+  "vocoder_config": {
+    "code_hop_size": 320,
+    "dur_prediction_weight": 1.0,
+    "dur_predictor_params": {
+      "encoder_embed_dim": 512,
+      "var_pred_dropout": 0.5,
+      "var_pred_hidden_dim": 512,
+      "var_pred_kernel_size": 3
+    },
+    "embedding_dim": 512,
+    "hop_size": 256,
+    "model_in_dim": 512,
+    "n_fft": 1024,
+    "num_embeddings": 1000,
+    "num_freq": 1025,
+    "num_mels": 80,
+    "resblock": 1,
+    "resblock_dilation_sizes": [
+      [
+        1,
+        3,
+        5
+      ],
+      [
+        1,
+        3,
+        5
+      ],
+      [
+        1,
+        3,
+        5
+      ]
+    ],
+    "resblock_kernel_sizes": [
+      3,
+      7,
+      11
+    ],
+    "sampling_rate": 16000,
+    "segment_size": 8960,
+    "upsample_initial_channel": 512,
+    "upsample_kernel_sizes": [
+      11,
+      8,
+      8,
+      4,
+      4
+    ],
+    "upsample_rates": [
+      5,
+      4,
+      4,
+      2,
+      2
+    ],
+    "win_size": 1024
+  },
+  "vocoder_path": {
+    "filename": "checkpoint.pt",
+    "repo_id": "scb10x/unit-vocoder-gcp-th-v1-00206600"
+  },
+  "whisper": {
+    "apply_spec_augment": true,
+    "begin_suppress_tokens": [
+      220,
+      50257
+    ],
+    "bos_token_id": 50257,
+    "d_model": 1280,
+    "decoder_attention_heads": 20,
+    "decoder_ffn_dim": 5120,
+    "decoder_layers": 32,
+    "decoder_start_token_id": 50258,
+    "encoder_attention_heads": 20,
+    "encoder_ffn_dim": 5120,
+    "encoder_layers": 32,
+    "eos_token_id": 50257,
+    "mask_feature_length": 64,
+    "mask_feature_prob": 0.1,
+    "mask_time_prob": 0.1,
+    "max_length": 448,
+    "model_type": "whisper",
+    "num_hidden_layers": 32,
+    "num_mel_bins": 128,
+    "vocab_size": 51866
+  },
+  "whisper_extractor_feature_size": 128
+}

configuration_typhoon2audio.py ADDED Viewed

	@@ -0,0 +1,166 @@

+from transformers import PretrainedConfig, WhisperConfig
+class BEATsConfig(PretrainedConfig):
+    def __init__(self, cfg=None):
+        # update the default values to BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt
+        self.input_patch_size: int = 16  # path size of patch embedding
+        self.embed_dim: int = 512  # patch embedding dimension
+        self.conv_bias: bool = False  # include bias in conv encoder
+        self.encoder_layers: int = 12  # num encoder layers in the transformer
+        self.encoder_embed_dim: int = 768  # encoder embedding dimension
+        self.encoder_ffn_embed_dim: int = 3072  # encoder embedding dimension for FFN
+        self.encoder_attention_heads: int = 12  # num encoder attention heads
+        self.activation_fn: str = "gelu"  # activation function to use
+        self.layer_wise_gradient_decay_ratio: float = 0.6  # ratio for layer-wise gradient decay
+        self.layer_norm_first: bool = False  # apply layernorm first in the transformer
+        self.deep_norm: bool = True  # apply deep_norm first in the transformer
+        # dropouts
+        self.dropout: float = 0.0  # dropout probability for the transformer
+        self.attention_dropout: float = 0.0  # dropout probability for attention weights
+        self.activation_dropout: float = 0.0  # dropout probability after activation in FFN
+        self.encoder_layerdrop: float = 0.05  # probability of dropping a tarnsformer layer
+        self.dropout_input: float = 0.0  # dropout to apply to the input (after feat extr)
+        # positional embeddings
+        self.conv_pos: int = 128  # number of filters for convolutional positional embeddings
+        self.conv_pos_groups: int = 16  # number of groups for convolutional positional embedding
+        # relative position embedding
+        self.relative_position_embedding: bool = True  # apply relative position embedding
+        self.num_buckets: int = 320  # number of buckets for relative position embedding
+        self.max_distance: int = 800  # maximum distance for relative position embedding
+        self.gru_rel_pos: bool = True  # apply gated relative position embedding
+        # label predictor
+        self.finetuned_model: bool = True  # whether the model is a fine-tuned model.
+        self.predictor_dropout: float = 0.0  # dropout probability for the predictor
+        self.predictor_class: int = 527  # target class number for the predictor
+        if cfg is not None:
+            self.update(cfg)
+    def update(self, cfg: dict):
+        self.__dict__.update(cfg)
+class Typhoon2AudioConfig(PretrainedConfig):
+    model_type = "typhoon2audio"
+    def __init__(self, **kwargs):
+        # LLM -- Llama3
+        self.llama_base_model = "scb10x/typhoon-2-llama31-8b-instruct-beta-v1"
+        # Whisper
+        self.whisper_extractor_feature_size=128
+        self.whisper = WhisperConfig(
+            activation_dropout=0.0,
+            activation_function="gelu",
+            apply_spec_augment=True,
+            attention_dropout=0.0,
+            begin_suppress_tokens=[220, 50257],
+            bos_token_id=50257,
+            d_model=1280,
+            decoder_attention_heads=20,
+            decoder_ffn_dim=5120,
+            decoder_layerdrop=0.0,
+            decoder_layers=32,
+            decoder_start_token_id=50258,
+            dropout=0.0,
+            encoder_attention_heads=20,
+            encoder_ffn_dim=5120,
+            encoder_layerdrop=0.0,
+            encoder_layers=32,
+            eos_token_id=50257,
+            init_std=0.02,
+            mask_feature_length=64,
+            mask_feature_min_masks=0,
+            mask_feature_prob=0.1,
+            mask_time_length=10,
+            mask_time_min_masks=2,
+            mask_time_prob=0.1,
+            max_length=448,
+            max_source_positions=1500,
+            max_target_positions=448,
+            median_filter_width=7,
+            num_hidden_layers=32,
+            num_mel_bins=128,
+            pad_token_id=50256,
+            scale_embedding=False,
+            use_weighted_layer_sum=False,
+            vocab_size=51866,
+        )
+        # BEATs
+        self.beats = BEATsConfig()
+        # Speech QFormer
+        self.speech_qformer_token_num=1
+        self.speech_qformer_layer=2
+        self.second_per_frame=0.333333
+        self.second_stride=0.333333
+        # SpeechDecoder CTC
+        self.pretraining_tp = 1
+        self.ctc_decoder_config='(4,4096,32,11008)'
+        self.ctc_upsample_factor=25
+        self.ctc_loss_weight=1.0
+        self.unit_vocab_size=1000
+        self.speech_decoder_ignore_index=-100
+        self.attention_bias=False
+        self.attention_dropout=0.0
+        self.bos_token_id=128000
+        self.eos_token_id=128009
+        self.head_dim=128
+        self.hidden_act="silu"
+        self.hidden_size=4096
+        self.intermediate_size=14336
+        self.max_position_embeddings=131072
+        self.mlp_bias=False
+        self.num_attention_heads=32
+        self.num_hidden_layers=32
+        self.num_key_value_heads=8
+        self.rms_norm_eps=1e-05
+        self.rope_scaling={
+            "factor": 8.0,
+            "high_freq_factor": 4.0,
+            "low_freq_factor": 1.0,
+            "original_max_position_embeddings": 8192,
+            "rope_type": "llama3"
+        }
+        self.rope_theta=500000.0
+        self.vocab_size=128256
+        # Unit Vocoder (HiFiGAN)
+        self.vocoder_path = {
+            'repo_id': 'scb10x/unit-vocoder-gcp-th-v1-00206600',
+            'filename': 'checkpoint.pt'
+        }
+        self.vocoder_config = {
+            'resblock': 1,
+            'upsample_rates': [5, 4, 4, 2, 2],
+            'upsample_kernel_sizes':  [11, 8, 8, 4, 4],
+            'upsample_initial_channel': 512,
+            'resblock_kernel_sizes': [3, 7, 11],
+            'resblock_dilation_sizes': [[1, 3, 5], [1, 3, 5], [1, 3, 5]],
+            'num_embeddings': 1000,
+            'embedding_dim': 512,
+            'model_in_dim': 512,
+            'segment_size': 8960,
+            'code_hop_size': 320,
+            'num_mels': 80,
+            'num_freq': 1025,
+            'n_fft': 1024,
+            'hop_size': 256,
+            'win_size': 1024,
+            'sampling_rate': 16000,
+            'dur_prediction_weight': 1.0,
+            'dur_predictor_params': {
+                'encoder_embed_dim': 512,
+                'var_pred_hidden_dim': 512,
+                'var_pred_kernel_size': 3,
+                'var_pred_dropout': 0.5
+            }
+        }
+        super().__init__(**kwargs)

generation_config.json ADDED Viewed

	@@ -0,0 +1,4 @@

+{
+  "_from_model_config": true,
+  "transformers_version": "4.45.0"
+}

modeling_typhoon2audio.py ADDED Viewed

The diff for this file is too large to render. See raw diff

pytorch_model-00001-of-00004.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:18a5ab4024b1f86f96f88917e374fa65a20faed26604387f537e5cef8fd02a72
+size 4884845301

pytorch_model-00002-of-00004.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:791f35f49a17984139a624e681d6034265b9f0af6e703e24fda5b61c72ffbf85
+size 4915939914

pytorch_model-00003-of-00004.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:fb1b59f02df617d2da6d37ec18556fdc758b7d4851c1e160ee5ba3035f376c21
+size 4915939978

pytorch_model-00004-of-00004.bin ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:7ec6b6fb06af427ed91f3cdc51c9228bf0e13eeb0d98c88f4a397a15ef841170
+size 4647114854

pytorch_model.bin.index.json ADDED Viewed

The diff for this file is too large to render. See raw diff