Upload feature extractor

Browse files

Files changed (7) hide show

README.md +199 -0
configuration.py +188 -0
encoder.py +110 -0
modelling.py +797 -0
preprocessor_config.json +10 -0
resnet.py +216 -0
utils.py +187 -0

README.md ADDED Viewed

	@@ -0,0 +1,199 @@

+---
+library_name: transformers
+tags: []
+---
+# Model Card for Model ID
+<!-- Provide a quick summary of what the model is/does. -->
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
+- **Developed by:** [More Information Needed]
+- **Funded by [optional]:** [More Information Needed]
+- **Shared by [optional]:** [More Information Needed]
+- **Model type:** [More Information Needed]
+- **Language(s) (NLP):** [More Information Needed]
+- **License:** [More Information Needed]
+- **Finetuned from model [optional]:** [More Information Needed]
+### Model Sources [optional]
+<!-- Provide the basic links for the model. -->
+- **Repository:** [More Information Needed]
+- **Paper [optional]:** [More Information Needed]
+- **Demo [optional]:** [More Information Needed]
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+[More Information Needed]
+### Downstream Use [optional]
+<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
+[More Information Needed]
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+[More Information Needed]
+## Bias, Risks, and Limitations
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+[More Information Needed]
+### Recommendations
+<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
+Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+[More Information Needed]
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+[More Information Needed]
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]

configuration.py ADDED Viewed

	@@ -0,0 +1,188 @@

+from typing import Tuple
+from transformers import PretrainedConfig
+class AVHubertConfig(PretrainedConfig):
+    model_type = "av_hubert"
+    def __init__(
+        self,
+        label_rate: int = 25,
+        sample_rate: int = 25,
+        input_modality: str = "video",
+        extractor_mode: str = "default",
+        encoder_layers: int = 24,
+        encoder_embed_dim: int = 1024,
+        encoder_ffn_embed_dim: int = 4096,
+        encoder_attention_heads: int = 16,
+        activation_fn: str = "gelu",
+        dropout: float = 0.1,
+        attention_dropout: float = 0.1,
+        activation_dropout: float = 0.1,
+        encoder_layerdrop: float = 0.0,
+        dropout_input: float = 0.0,
+        dropout_features: float = 0.0,
+        final_dim: int = 256,
+        untie_final_proj: bool = False,
+        layer_norm_first: bool = False,
+        conv_feature_layers: str = "[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2",
+        conv_bias: bool = False,
+        logit_temp: float = 0.1,
+        target_glu: bool = False,
+        feature_grad_mult: float = 1.0,
+        mask_length_audio: int = 10,
+        mask_prob_audio: float = 0.65,
+        mask_length_image: int = 10,
+        mask_prob_image: float = 0.65,
+        mask_selection: str = "static",
+        mask_other: float = 0.0,
+        no_mask_overlap: bool = False,
+        mask_min_space: int = 1,
+        mask_channel_length: int = 64,
+        mask_channel_prob: float = 0.5,
+        mask_channel_selection: str = "static",
+        mask_channel_other: float = 0.0,
+        no_mask_channel_overlap: bool = False,
+        mask_channel_min_space: int = 1,
+        conv_pos: int = 128,
+        conv_pos_groups: int = 16,
+        latent_temp: Tuple[float, float, float] = (2.0, 0.5, 0.999995),
+        skip_masked: bool = False,
+        skip_nomask: bool = False,
+        resnet_relu_type: str = "prelu",
+        resnet_weights: str = None,
+        sim_type: str = "cosine",
+        sub_encoder_layers: int = 0,
+        audio_feat_dim: int = 104,
+        modality_dropout: float = 0.0,
+        audio_dropout: float = 0.0,
+        modality_fuse: str = "concat",
+        selection_type: str = "same_other_seq",
+        masking_type: str = "input",
+        decoder_embed_dim: int = 2560,
+        decoder_ffn_embed_dim: int = 3072,
+        decoder_layers: int = 6,
+        decoder_layerdrop: float = 0.0,
+        decoder_attention_heads: int = 4,
+        decoder_learned_pos: bool = False,
+        decoder_normalize_before: bool = False,
+        no_token_positional_embeddings: bool = False,
+        decoder_dropout: float = 0.1,
+        decoder_attention_dropout: float = 0.1,
+        decoder_activation_dropout: float = 0.0,
+        max_target_positions: int = 2048,
+        share_decoder_input_output_embed: bool = False,
+        no_scale_embedding: bool = True,
+        num_classes: int = 2004,
+        feature_ds_rate: int = 1,
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.label_rate = label_rate
+        self.sample_rate = sample_rate
+        self.input_modality = input_modality
+        self.extractor_mode = extractor_mode
+        self.encoder_layers = encoder_layers
+        self.encoder_embed_dim = encoder_embed_dim
+        self.encoder_ffn_embed_dim = encoder_ffn_embed_dim
+        self.encoder_attention_heads = encoder_attention_heads
+        self.activation_fn = activation_fn
+        self.dropout = dropout
+        self.attention_dropout = attention_dropout
+        self.activation_dropout = activation_dropout
+        self.encoder_layerdrop = encoder_layerdrop
+        self.dropout_input = dropout_input
+        self.dropout_features = dropout_features
+        self.final_dim = final_dim
+        self.untie_final_proj = untie_final_proj
+        self.layer_norm_first = layer_norm_first
+        self.conv_feature_layers = conv_feature_layers
+        self.conv_bias = conv_bias
+        self.logit_temp = logit_temp
+        self.target_glu = target_glu
+        self.feature_grad_mult = feature_grad_mult
+        self.mask_length_audio = mask_length_audio
+        self.mask_prob_audio = mask_prob_audio
+        self.mask_length_image = mask_length_image
+        self.mask_prob_image = mask_prob_image
+        self.mask_selection = mask_selection
+        self.mask_other = mask_other
+        self.no_mask_overlap = no_mask_overlap
+        self.mask_min_space = mask_min_space
+        self.mask_channel_length = mask_channel_length
+        self.mask_channel_prob = mask_channel_prob
+        self.mask_channel_selection = mask_channel_selection
+        self.mask_channel_other = mask_channel_other
+        self.no_mask_channel_overlap = no_mask_channel_overlap
+        self.mask_channel_min_space = mask_channel_min_space
+        self.conv_pos = conv_pos
+        self.conv_pos_groups = conv_pos_groups
+        self.latent_temp = latent_temp
+        self.skip_masked = skip_masked
+        self.skip_nomask = skip_nomask
+        self.resnet_relu_type = resnet_relu_type
+        self.resnet_weights = resnet_weights
+        self.sim_type = sim_type
+        self.sub_encoder_layers = sub_encoder_layers
+        self.audio_feat_dim = audio_feat_dim
+        self.modality_dropout = modality_dropout
+        self.audio_dropout = audio_dropout
+        self.modality_fuse = modality_fuse
+        self.selection_type = selection_type
+        self.masking_type = masking_type
+        self.decoder_embed_dim = decoder_embed_dim
+        self.decoder_ffn_embed_dim = decoder_ffn_embed_dim
+        self.decoder_layers = decoder_layers
+        self.decoder_layerdrop = decoder_layerdrop
+        self.decoder_attention_heads = decoder_attention_heads
+        self.decoder_learned_pos = decoder_learned_pos
+        self.decoder_normalize_before = decoder_normalize_before
+        self.no_token_positional_embeddings = no_token_positional_embeddings
+        self.decoder_dropout = decoder_dropout
+        self.decoder_attention_dropout = decoder_attention_dropout
+        self.decoder_activation_dropout = decoder_activation_dropout
+        self.max_target_positions = max_target_positions
+        self.share_decoder_input_output_embed = share_decoder_input_output_embed
+        self.no_scale_embedding = no_scale_embedding
+        self.num_classes = num_classes
+        self.feature_ds_rate = feature_ds_rate
+class AVSPLLMConfig(AVHubertConfig):
+    model_type = "avsp_llm"
+    def __init__(
+        self,
+        llm_ckpt_path: str = "vilm/vinallama-2.7b",
+        cache_dir: str = "models/huggingface",
+        no_pretrained_weights: bool = False,
+        final_dropout: float = 0.1,
+        apply_mask: bool = False,
+        mask_length: int = 10,
+        mask_prob: float = 0.5,
+        masking_updates: int = 0,
+        layerdrop: float = 0.0,
+        normalize: bool = False,
+        data: str = None,
+        w2v_args: dict = None,
+        freeze_finetune_updates: int = 0,
+        km_path: str = "model.km",
+        **kwargs,
+    ) -> None:
+        super().__init__(**kwargs)
+        self.llm_ckpt_path = llm_ckpt_path
+        self.cache_dir = cache_dir
+        self.no_pretrained_weights = no_pretrained_weights
+        self.final_dropout = final_dropout
+        self.apply_mask = apply_mask
+        self.mask_length = mask_length
+        self.mask_prob = mask_prob
+        self.masking_updates = masking_updates
+        self.layerdrop = layerdrop
+        self.normalize = normalize
+        self.data = data
+        self.w2v_args = w2v_args
+        self.freeze_finetune_updates = freeze_finetune_updates
+        self.km_path = km_path

encoder.py ADDED Viewed

	@@ -0,0 +1,110 @@

+import math
+import torch
+import numpy as np
+import torch.nn as nn
+import torch.nn.functional as F
+from typing import List, Optional, Tuple
+from .configuration import AVHubertConfig
+from fairseq.utils import index_put
+from fairseq.modules import LayerNorm, SamePad
+from fairseq.models.wav2vec.wav2vec2 import TransformerSentenceEncoderLayer
+from fairseq.modules.transformer_sentence_encoder import init_bert_params
+class TransformerEncoder(nn.Module):
+    def __init__(self, config: AVHubertConfig) -> None:
+        super().__init__()
+        self.dropout = config.dropout
+        self.embedding_dim = config.encoder_embed_dim
+        self.pos_conv = nn.Conv1d(
+            self.embedding_dim,
+            self.embedding_dim,
+            kernel_size=config.conv_pos,
+            padding=config.conv_pos // 2,
+            groups=config.conv_pos_groups,
+        )
+        dropout = 0
+        std = math.sqrt((4 * (1.0 - dropout)) / (config.conv_pos * self.embedding_dim))
+        nn.init.normal_(self.pos_conv.weight, mean=0, std=std)
+        nn.init.constant_(self.pos_conv.bias, 0)
+        self.pos_conv = nn.utils.weight_norm(
+            self.pos_conv, name="weight", dim=2
+        )
+        self.pos_conv = nn.Sequential(self.pos_conv, SamePad(config.conv_pos), nn.GELU())
+        self.layers = nn.ModuleList(
+            [
+                TransformerSentenceEncoderLayer(
+                    embedding_dim=self.embedding_dim,
+                    ffn_embedding_dim=config.encoder_ffn_embed_dim,
+                    num_attention_heads=config.encoder_attention_heads,
+                    dropout=self.dropout,
+                    attention_dropout=config.attention_dropout,
+                    activation_dropout=config.activation_dropout,
+                    activation_fn=config.activation_fn,
+                    layer_norm_first=config.layer_norm_first,
+                )
+                for _ in range(config.encoder_layers)
+            ]
+        )
+        self.layer_norm_first = config.layer_norm_first
+        self.layer_norm = LayerNorm(self.embedding_dim)
+        self.layerdrop = config.encoder_layerdrop
+        self.apply(init_bert_params)
+    def forward(
+        self,
+        x: torch.Tensor,
+        padding_mask: Optional[torch.Tensor] = None,
+        layer: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, List[Tuple[torch.Tensor, torch.Tensor]]]:
+        x, layer_results = self.extract_features(x, padding_mask, layer)
+        if self.layer_norm_first and layer is None:
+            x = self.layer_norm(x)
+        return x, layer_results
+    def extract_features(
+        self,
+        x: torch.Tensor,
+        padding_mask: Optional[torch.Tensor] = None,
+        tgt_layer: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, List[Tuple[torch.Tensor, torch.Tensor]]]:
+        if padding_mask is not None:
+            x = index_put(x, padding_mask, 0)
+        x_conv = self.pos_conv(x.transpose(1, 2))
+        x_conv = x_conv.transpose(1, 2)
+        x = x + x_conv
+        if not self.layer_norm_first:
+            x = self.layer_norm(x)
+        x = F.dropout(x, p=self.dropout, training=self.training)
+        # B x T x C -> T x B x C
+        x = x.transpose(0, 1)
+        layer_results = []
+        r = None
+        for i, layer in enumerate(self.layers):
+            dropout_probability = np.random.random()
+            if not self.training or (dropout_probability > self.layerdrop):
+                x, z = layer(x, self_attn_padding_mask=padding_mask, need_weights=False)
+                if tgt_layer is not None:
+                    layer_results.append((x, z))
+            if i == tgt_layer:
+                r = x
+                break
+        if r is not None:
+            x = r
+        # T x B x C -> B x T x C
+        x = x.transpose(0, 1)
+        return x, layer_results

modelling.py ADDED Viewed

	@@ -0,0 +1,797 @@

+import torch
+import logging
+import contextlib
+import numpy as np
+import torch.nn as nn
+from pathlib import Path
+from .resnet import ResNetEncoder
+from .encoder import TransformerEncoder
+from .configuration import AVHubertConfig, AVSPLLMConfig
+from .utils import compute_mask_indices, load_kmeans_model
+from typing import Optional, Tuple, List, Dict, Any
+from peft import get_peft_model, LoraConfig
+from fairseq.modules import GradMultiply, LayerNorm
+from transformers.modeling_outputs import CausalLMOutputWithPast
+from transformers import (
+    FeatureExtractionMixin,
+    PreTrainedModel,
+    BitsAndBytesConfig,
+    AutoModelForCausalLM,
+    GenerationConfig,
+)
+logging.root.setLevel(logging.WARNING)
+class AVHubertFeatureExtractor(FeatureExtractionMixin):
+    def __init__(self, config: AVHubertConfig = AVHubertConfig(), **kwargs) -> None:
+        super().__init__(**kwargs)
+        self.audio_feat_dim = config.audio_feat_dim
+        self.size = 88
+        self.num_frames = 76
+        self.num_channels = 1
+class AVSPLLMFeatureExtractor(AVHubertFeatureExtractor):
+    def __init__(self, config: AVSPLLMConfig = AVSPLLMConfig(), **kwargs) -> None:
+        super().__init__(config, **kwargs)
+class AVHubertVideoFeatureEncoder(nn.Module):
+    def __init__(self, config: AVHubertConfig) -> None:
+        super().__init__()
+        self.resnet = ResNetEncoder(relu_type=config.resnet_relu_type)
+        self.proj = nn.Linear(self.resnet.backend_out, config.encoder_embed_dim)
+        self.encoder = (
+            TransformerEncoder(config)
+            if config.sub_encoder_layers > 0
+            else None
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.resnet(x)
+        x = self.proj(x.transpose(1, 2))
+        if self.encoder is not None:
+            x = self.encoder(x)[0].transpose(1, 2)
+        else:
+            x = x.transpose(1, 2)
+        return x
+class AVHubertAudioFeatureEncoder(nn.Module):
+    def __init__(self, config: AVHubertConfig) -> None:
+        super().__init__()
+        self.proj = nn.Linear(config.audio_feat_dim, config.encoder_embed_dim)
+        self.encoder = (
+            TransformerEncoder(config)
+            if config.sub_encoder_layers > 0
+            else None
+        )
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.proj(x.transpose(1, 2))
+        if self.encoder is not None:
+            x = self.encoder(x)[0].transpose(1, 2)
+        else:
+            x = x.transpose(1, 2)
+        return x
+class AVHubertModel(PreTrainedModel):
+    config_class = AVHubertConfig
+    def __init__(
+        self,
+        config: AVHubertConfig = AVHubertConfig(),
+        dictionaries: List = [None],
+    ) -> None:
+        super().__init__(config=config)
+        label_rate = config.label_rate
+        feature_ds_rate = config.feature_ds_rate
+        sample_rate = config.sample_rate
+        self.feat2tar_ration = label_rate * feature_ds_rate / sample_rate
+        self.feature_extractor_video = AVHubertVideoFeatureEncoder(config)
+        self.feature_extractor_audio = AVHubertAudioFeatureEncoder(config)
+        if config.modality_fuse == "concat":
+            self.encoder_embed_dim = config.encoder_embed_dim * 2
+        elif config.modality_fuse == "add":
+            self.encoder_embed_dim = config.encoder_embed_dim
+        self.post_extract_proj = (
+            nn.Linear(self.encoder_embed_dim, config.encoder_embed_dim)
+            if self.encoder_embed_dim != config.encoder_embed_dim
+            else None
+        )
+        self.dropout_input = nn.Dropout(config.dropout_input)
+        self.dropout_features = nn.Dropout(config.dropout_features)
+        if self.config.final_dim > 0:
+            final_dim = config.final_dim
+        else:
+            final_dim = config.encoder_embed_dim
+        self.mask_emb = nn.Parameter(
+            torch.FloatTensor(config.audio_feat_dim).uniform_()
+            if config.masking_type == "input"
+            else torch.FloatTensor(config.encoder_embed_dim).uniform_()
+        )
+        self.encoder = TransformerEncoder(self.config)
+        self.layer_norm = LayerNorm(self.encoder_embed_dim)
+        self.target_glu = None
+        if config.target_glu:
+            self.target_glu = nn.Sequential(
+                nn.Linear(config.final_dim, config.final_dim * 2),
+                nn.GLU(),
+            )
+        if config.untie_final_proj:
+            self.final_proj = nn.Linear(
+                config.encoder_embed_dim,
+                final_dim * len(dictionaries),
+            )
+        else:
+            self.final_proj = nn.Linear(config.encoder_embed_dim, final_dim)
+        # modules below are not needed during fine-tuning
+        if any([d is None for d in dictionaries]):
+            self.num_classes = config.num_classes
+        else:
+            self.num_classes = sum([len(d) for d in dictionaries])
+        self.label_embs_concat = nn.Parameter(
+            torch.FloatTensor(self.num_classes, final_dim)
+        )
+        nn.init.uniform_(self.label_embs_concat)
+    def apply_input_mask(
+        self,
+        x: torch.Tensor,
+        padding_mask: torch.Tensor,
+        target_list: List[torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        B, C, T = x.shape[:3]
+        is_audio = True if len(x.shape) == 3 else False
+        if is_audio:
+            mask_prob = self.config.mask_prob_audio
+            mask_length = self.config.mask_length_audio
+        else:
+            mask_prob = self.config.mask_prob_image
+            mask_length = self.config.mask_length_image
+        if mask_prob > 0:
+            mask_indices, starts, ends, batch_indexes = compute_mask_indices(
+                (B, T),
+                padding_mask,
+                mask_prob,
+                mask_length,
+                self.config.mask_selection,
+                self.config.mask_other,
+                min_masks=2,
+                no_overlap=self.config.no_mask_overlap,
+                min_space=self.config.mask_min_space,
+            )
+            mask_indices = torch.from_numpy(mask_indices).to(x.device)
+            x = x.transpose(1, 2).contiguous()  # [B, T, C, H, W]
+            if B == 1:
+                x[mask_indices] = 0
+            elif is_audio:
+                x[mask_indices] = self.mask_emb
+            elif self.config.selection_type == "same_other_seq":
+                perm = (torch.arange(B) + torch.randint(low=1, high=B, size=(1,))) % B
+                x_perm = x[perm]
+                x[mask_indices] = x_perm[mask_indices]
+            elif self.config.selection_type == "same_seq":
+                batch_indexes_, other_indexes = [], []
+                for batch_index, start, end in zip(batch_indexes, starts, ends):
+                    length = end - start
+                    other_start = np.setdiff1d(
+                        np.arange(T), np.arange(max(0, start - length), end)
+                    )
+                    if len(other_start) > 0:
+                        other_start = np.random.choice(other_start, size=1)
+                    else:
+                        other_start = 0
+                    other_end = other_start + length
+                    other_indexes.append(
+                        np.arange(other_start, other_end).clip(max=T - 1)
+                    )
+                    batch_indexes_.append(
+                        np.zeros([length], dtype=np.int64) + batch_index
+                    )
+                batch_indexes = np.concatenate(batch_indexes_)
+                other_indexes = np.concatenate(other_indexes)
+                x[mask_indices] = x[batch_indexes, other_indexes]
+            x = x.transpose(1, 2).contiguous()
+        else:
+            mask_indices = None
+        if self.config.mask_channel_prob > 0:
+            logging.warn("No mask channel prob for input masking")
+        return x, mask_indices
+    def apply_feature_mask(
+        self,
+        x: torch.Tensor,
+        padding_mask: torch.Tensor,
+        target_list: List[torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        B, T, C = x.shape
+        assert all((
+            self.config.mask_prob_audio == self.config.mask_prob_image,
+            self.config.mask_length_audio == self.config.mask_length_image,
+        )), "masking prob/length for image/audio be same for feature masking"
+        mask_prob = self.config.mask_prob_audio
+        mask_length = self.config.mask_length_image
+        if mask_prob > 0:
+            mask_indices, _, _, _ = compute_mask_indices(
+                (B, T),
+                padding_mask,
+                mask_prob,
+                mask_length,
+                self.config.mask_selection,
+                self.config.mask_other,
+                min_masks=2,
+                no_overlap=self.config.no_mask_overlap,
+                min_space=self.config.mask_min_space,
+            )
+            mask_indices = torch.from_numpy(mask_indices).to(x.device)
+            x[mask_indices] = self.mask_emb
+        else:
+            mask_indices = None
+        if self.config.mask_channel_prob > 0:
+            mask_channel_indices, _, _, _ = compute_mask_indices(
+                (B, C),
+                None,
+                self.config.mask_channel_prob,
+                self.config.mask_channel_length,
+                self.config.mask_channel_selection,
+                self.config.mask_channel_other,
+                no_overlap=self.config.no_mask_channel_overlap,
+                min_space=self.config.mask_channel_min_space,
+            )
+            mask_channel_indices = (
+                torch.from_numpy(mask_channel_indices)
+                .to(x.device)
+                .unsqueeze(1)
+                .expand(-1, T, -1)
+            )
+            x[mask_channel_indices] = 0
+        return x, mask_indices
+    def forward_features(
+        self,
+        source: Dict[str, torch.Tensor],
+        modality: str,
+    ) -> torch.Tensor:
+        extractor = eval(f"self.feature_extractor_{modality}")
+        if self.config.feature_grad_mult > 0:
+            features = extractor(source)
+            if self.config.feature_grad_mult != 1.0:
+                features = GradMultiply.apply(features, self.config.feature_grad_mult)
+        else:
+            with torch.no_grad():
+                features = extractor(source)
+        return features
+    def forward_targets(
+        self,
+        features: torch.Tensor,
+        mask_indices: torch.Tensor,
+        target_list: List[torch.Tensor],
+    ) -> Tuple[torch.Tensor, torch.Tensor, List[torch.Tensor]]:
+        # Trim features to ensure labels exist and then get aligned labels
+        feat_tsz = features.size(2)
+        targ_tsz = min([t.size(1) for t in target_list])
+        if self.feat2tar_ratio * feat_tsz > targ_tsz:
+            feat_tsz = int(targ_tsz / self.feat2tar_ratio)
+            features = features[..., :feat_tsz]
+            if mask_indices is not None:
+                mask_indices = mask_indices[..., :feat_tsz]
+        target_inds = torch.arange(feat_tsz).float() * self.feat2tar_ratio
+        target_list = [t[:, target_inds.long()] for t in target_list]
+        return features, mask_indices, target_list
+    def forward_padding_mask(
+        self,
+        features: torch.Tensor,
+        padding_mask: torch.Tensor,
+    ) -> torch.Tensor:
+        extra = padding_mask.size(1) % features.size(1)
+        if extra > 0:
+            padding_mask = padding_mask[:, :-extra]
+        padding_mask = padding_mask.view(padding_mask.size(0), features.size(1), -1)
+        padding_mask = padding_mask.all(-1)
+        return padding_mask
+    def compute_logits(self, feats: torch.Tensor, emb_mat: torch.Tensor) -> torch.Tensor:
+        # feats: [B, T, F], emb_mat: [V, F]
+        if self.config.sim_type == "dot":
+            logits = torch.matmul(feats, emb_mat.transpose(0, 1))
+        elif self.config.sim_type == "cosine":
+            batch_size, timesteps, emb_dim = feats.size()
+            feats_ = feats.view(-1, emb_dim)
+            # [B*T, V]
+            nom = (feats_.unsqueeze(dim=1) * emb_mat.unsqueeze(dim=0)).sum(dim=-1)
+            # [B*T, V]
+            denom = (
+                (feats_**2).sum(dim=-1).sqrt().unsqueeze(dim=1)
+                * (emb_mat**2).sum(dim=-1).sqrt().unsqueeze(dim=0)
+            )
+            logits = (nom / denom.clamp(min=1e-6)).view(batch_size, timesteps, -1)
+        else:
+            raise NotImplementedError
+        logits = logits / self.config.logit_temp
+        return logits
+    def forward(
+        self,
+        source: Dict[str, torch.Tensor],
+        target_list: Optional[List[torch.Tensor]] = None,
+        padding_mask: Optional[torch.Tensor] = None,
+        mask: bool = True,
+        features_only: bool = False,
+        output_layer: Optional[int] = None,
+    ) -> Dict[str, torch.Tensor]:
+        """output layer is 1-based"""
+        src_audio, src_video = source["audio"], source["video"]
+        if mask and self.masking_type == "input":
+            src_video, mask_indices_video = self.apply_input_mask(
+                src_video, padding_mask, target_list
+            )
+            src_audio, mask_indices_audio = self.apply_input_mask(
+                src_audio, padding_mask, target_list
+            )
+            mask_indices = torch.logical_or(mask_indices_audio, mask_indices_video)
+        else:
+            src_audio, src_video, mask_indices = src_audio, src_video, None
+        # [B, F, T]
+        features_audio = self.forward_features(src_audio, modality="audio")
+        features_video = self.forward_features(src_video, modality="video")
+        if self.training:
+            modality_drop_prob, audio_drop_prob = np.random.random(), np.random.random()
+            if modality_drop_prob < self.config.modality_dropout:
+                if audio_drop_prob < self.config.audio_dropout:
+                    features_audio = 0 * features_audio
+                else:
+                    features_video = 0 * features_video
+        if self.config.modality_fuse == "concat":
+            features = torch.cat([features_audio, features_video], dim=1)
+        elif self.config.modality_fuse == "add":
+            features = features_audio + features_video
+        if target_list is not None:
+            features, mask_indices, target_list = self.forward_targets(
+                features, mask_indices, target_list
+            )
+        features_pen = features.float().pow(2).mean()
+        features = features.transpose(1, 2)
+        features = self.layer_norm(features)
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(features, padding_mask)
+        if self.post_extract_proj is not None:
+            features = self.post_extract_proj(features)
+        features = self.dropout_input(features)
+        if self.config.masking_type == "feature" and mask:
+            x, mask_indices = self.apply_feature_mask(
+                features, padding_mask, target_list
+            )
+        else:
+            x = features
+        # feature: (B, T, D), float
+        # target: (B, T), long
+        # x: (B, T, D), float
+        # padding_mask: (B, T), bool
+        # mask_indices: (B, T), bool
+        x, _ = self.encoder(
+            x,
+            padding_mask=padding_mask,
+            layer=None if output_layer is None else output_layer - 1,
+        )
+        if features_only:
+            return {"x": x, "padding_mask": padding_mask, "features": features}
+        label_embs_list = self.label_embs_concat.split(self.num_classes, 0)
+        proj_x = self.final_proj(x)
+        if self.config.untie_final_proj:
+            proj_x_list = proj_x.chunk(len(self.num_classes), dim=-1)
+        else:
+            proj_x_list = [proj_x for _ in self.num_classes]
+        # [[B*T, V]]
+        logit_list = [
+            self.compute_logits(proj, emb).view(-1, num_class)
+            for proj, emb, num_class in zip(
+                proj_x_list, label_embs_list, self.num_classes
+            )
+        ]
+        mask = torch.logical_and(mask_indices, ~padding_mask).view(-1)
+        unmask = torch.logical_and(~mask_indices, ~padding_mask).view(-1)  # [B*T]
+        logit_m_list = [logit[mask] for logit in logit_list]
+        logit_u_list = [logit[unmask] for logit in logit_list]
+        target_m_list = [target.view(-1)[mask].long() for target in target_list]
+        target_u_list = [target.view(-1)[unmask].long() for target in target_list]
+        return {
+            "logit_m_list": logit_m_list,
+            "logit_u_list": logit_u_list,
+            "target_m_list": target_m_list,
+            "target_u_list": target_u_list,
+            "padding_mask": padding_mask,
+            "features_pen": features_pen,
+        }
+    def extract_features(
+        self,
+        source: Dict[str, torch.Tensor],
+        padding_mask: Optional[torch.Tensor] = None,
+        mask: bool = False,
+        ret_conv: bool = False,
+        output_layer: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        res = self.forward(
+            source,
+            padding_mask=padding_mask,
+            mask=mask,
+            features_only=True,
+            output_layer=output_layer,
+        )
+        feature = res["features"] if ret_conv else res["x"]
+        return feature, res["padding_mask"]
+    def extract_units(
+        self,
+        source: Dict[str, torch.Tensor],
+        padding_mask: torch.Tensor = None,
+        mask: bool = False,
+        ret_conv: bool = False,
+        output_layer: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        res = self.forward(
+            source,
+            padding_mask=padding_mask,
+            mask=mask,
+            features_only=True,
+            output_layer=None,
+        )
+        feature = res["features"] if ret_conv else res["x"]
+        proj_x = self.final_proj(feature)
+        # B T
+        units = (
+            torch
+            .matmul(proj_x, self.label_embs_concat.transpose(0, 1))
+            .argmax(dim=-1)
+        )
+        return units
+    def extract_finetune(
+        self,
+        source: Dict[str, torch.Tensor],
+        padding_mask: torch.Tensor = None,
+        mask: bool = False,
+        ret_conv: bool = False,
+        output_layer: Optional[int] = None,
+    ) -> Tuple[torch.Tensor, torch.Tensor]:
+        src_audio, src_video = source["audio"], source["video"]
+        if mask and self.config.masking_type == "input":
+            src_video, _ = self.apply_input_mask(
+                src_video, padding_mask, target_list=None
+            )
+            src_audio, _ = self.apply_input_mask(
+                src_audio, padding_mask, target_list=None
+            )
+        else:
+            src_audio, src_video, _ = src_audio, src_video, None
+        # features: [B, F, T]
+        if src_audio is not None and src_video is None:
+            features_audio = self.forward_features(
+                src_audio, modality="audio"
+            )
+            features_video = features_audio.new_zeros(
+                features_audio.size(0),
+                self.encoder_embed_dim,
+                features_audio.size(-1)
+            )
+        elif src_audio is None and src_video is not None:
+            features_video = self.forward_features(src_video, modality="video")
+            features_audio = features_video.new_zeros(
+                features_video.size(0),
+                self.encoder_embed_dim,
+                features_video.size(-1)
+            )
+        elif src_audio is not None and src_video is not None:
+            features_video = self.forward_features(src_video, modality="video")
+            features_audio = self.forward_features(
+                src_audio, modality="audio"
+            )
+        if self.config.modality_fuse == "concat":
+            features = torch.cat([features_audio, features_video], dim=1)
+        elif self.config.modality_fuse == "add":
+            features = features_audio + features_video
+        features = features.transpose(1, 2)
+        features = self.layer_norm(features)
+        unmasked_features = features.clone()
+        if padding_mask is not None:
+            padding_mask = self.forward_padding_mask(features, padding_mask)
+        if self.post_extract_proj is not None:
+            features = self.post_extract_proj(features)
+        features = self.dropout_input(features)
+        unmasked_features = self.dropout_features(unmasked_features)
+        # feature: (B, T, D), float
+        # target: (B, T), long
+        # x: (B, T, D), float
+        # padding_mask: (B, T), bool
+        # mask_indices: (B, T), bool
+        x, _ = self.encoder(
+            features,
+            padding_mask=padding_mask,
+            layer=None if output_layer is None else output_layer - 1,
+        )
+        return x, padding_mask
+    def get_extra_losses(
+        self,
+        net_output: Dict[str, torch.Tensor],
+    ) -> Tuple[List[torch.Tensor], List[str]]:
+        extra_losses = []
+        names = []
+        if "features_pen" in net_output:
+            extra_losses.append(net_output["features_pen"])
+            names.append("features_pen")
+        return extra_losses, names
+    def remove_pretraining_modules(self) -> None:
+        self.target_glu = None
+        self.final_proj = None
+    def compute_nce(
+        self,
+        x: torch.Tensor,
+        pos: torch.Tensor,
+        negs: torch.Tensor,
+    ) -> torch.Tensor:
+        neg_is_pos = (pos == negs).all(-1)
+        pos = pos.unsqueeze(0)
+        targets = torch.cat([pos, negs], dim=0)
+        logits = torch.cosine_similarity(x.float(), targets.float(), dim=-1).type_as(x)
+        logits /= self.config.logit_temp
+        if neg_is_pos.any():
+            logits[1:][neg_is_pos] = float("-inf")
+        logits = logits.transpose(0, 1)  # (num_x, num_cls+1)
+        return logits
+class HubertEncoderWrapper(nn.Module):
+    def __init__(
+        self,
+        config: AVHubertConfig,
+        dictionaries: List = [None],
+    ) -> None:
+        super().__init__()
+        self.w2v_model = AVHubertModel(config, dictionaries)
+    def forward(
+        self,
+        source: Dict[str, torch.Tensor],
+        padding_mask: torch.Tensor,
+        **kwargs,
+    ) -> Dict[str, torch.Tensor]:
+        w2v_args = {
+            "source": source,
+            "padding_mask": padding_mask,
+        }
+        x, padding_mask = self.w2v_model.extract_finetune(**w2v_args)
+        return {
+            "encoder_out": x,  # T x B x C
+            "encoder_padding_mask": padding_mask,  # B x T
+            "padding_mask": padding_mask,
+        }
+    def reorder_encoder_out(
+        self,
+        encoder_out: Dict[str, torch.Tensor],
+        new_order: torch.Tensor,
+    ) -> Dict[str, torch.Tensor]:
+        if encoder_out["encoder_out"] is not None:
+            encoder_out["encoder_out"] = encoder_out["encoder_out"].index_select(
+                1, new_order
+            )
+        if encoder_out["encoder_padding_mask"] is not None:
+            encoder_out["encoder_padding_mask"] = encoder_out[
+                "encoder_padding_mask"
+            ].index_select(0, new_order)
+        if encoder_out["padding_mask"] is not None:
+            encoder_out["padding_mask"] = encoder_out["padding_mask"].index_select(
+                0, new_order
+            )
+        return encoder_out
+class AVSPLLMModel(PreTrainedModel):
+    config_class = AVSPLLMConfig
+    def __init__(
+        self,
+        config: AVSPLLMConfig = AVSPLLMConfig(),
+        dictionaries: List = [None],
+    ) -> None:
+        super().__init__(config=config)
+        current_dir = Path(__file__).resolve().parent
+        self.km_path = current_dir / config.km_path
+        if not self.km_path.is_file():
+            repo_id = self.config._name_or_path
+            self.km_path = f"{repo_id}/model.km"
+        self.km_path = str(self.km_path)
+        self.C, self.Cnorm = load_kmeans_model(self.km_path)
+        self.encoder = HubertEncoderWrapper(config, dictionaries)
+        self.encoder.w2v_model.remove_pretraining_modules()
+        self.avfeat_to_llm = nn.Linear(
+            config.encoder_embed_dim, config.decoder_embed_dim
+        )
+        bnb_config = BitsAndBytesConfig(
+            load_in_4bit=True,
+            bnb_4bit_use_double_quant=True,
+            bnb_4bit_quant_type="nf4",
+            bnb_4bit_compute_dtype=torch.bfloat16,
+        )
+        decoder_4bit = AutoModelForCausalLM.from_pretrained(
+            config.llm_ckpt_path,
+            quantization_config=bnb_config,
+            cache_dir=config.cache_dir,
+            trust_remote_code=True,
+        )
+        lora_config = LoraConfig(
+            r=16,
+            lora_alpha=32,
+            target_modules=["q_proj", "v_proj", "k_proj"],
+            lora_dropout=0.05,
+            bias="none",
+            task_type="CAUSAL_LM",
+        )
+        self.decoder = get_peft_model(decoder_4bit, lora_config)
+        self.decoder.print_trainable_parameters()
+    def apply_kmeans(self, feat: torch.Tensor) -> torch.Tensor:
+        dist = (
+            feat.squeeze(0).pow(2).sum(1, keepdim=True)
+            - 2 * torch.matmul(feat.squeeze(0), self.C)
+            + self.Cnorm
+        )
+        cluster_counts = dist.argmin(dim=1)
+        current_counts = 1
+        counts = []
+        for i in range(1, len(cluster_counts)):
+            if cluster_counts[i] == cluster_counts[i - 1]:
+                current_counts += 1
+            else:
+                counts.append(current_counts)
+                current_counts = 1
+        counts.append(current_counts)
+        return torch.tensor(counts)
+    def deduplicate(
+        self,
+        feat: torch.Tensor,
+        cluster_counts: torch.Tensor,
+    ) -> torch.Tensor:
+        results_tensor = []
+        start_idx = 0
+        for clutser_num in cluster_counts:
+            end_idx = start_idx + clutser_num
+            slice = feat[:, start_idx:end_idx, :]
+            mean_tensor = torch.mean(slice, dim=1, keepdim=True)
+            results_tensor.append(mean_tensor)
+            start_idx = end_idx
+        assert cluster_counts.sum().item() == feat.size()[1], \
+            f"{cluster_counts.sum().item()} != {feat.size()[1]}"
+        return torch.cat(results_tensor, dim=1)
+    def embed(
+        self,
+        source: Dict[str, torch.Tensor],
+        padding_mask: torch.Tensor,
+        target_list: torch.Tensor = None,
+        **kwargs,
+    ) -> torch.Tensor:
+        ft = self.config.freeze_finetune_updates <= kwargs.get("num_updates", -1)
+        with torch.no_grad() if not ft else contextlib.ExitStack():
+            output = self.encoder(source, padding_mask, **kwargs)
+        cluster_counts = self.apply_kmeans(output["encoder_out"])
+        output["encoder_out"] = self.avfeat_to_llm(output["encoder_out"])
+        reduced_enc_out = self.deduplicate(output["encoder_out"], cluster_counts)
+        reduced_enc_out = reduced_enc_out.to(self.decoder.device)
+        B, T, D = reduced_enc_out.size()
+        instruction = source["text"]
+        instruction_embedding = self.decoder.model.model.embed_tokens(instruction)
+        llm_input = torch.cat((instruction_embedding, reduced_enc_out), dim=1)
+        if target_list is None:
+            return llm_input, None
+        labels = target_list.clone()
+        labels_embedding = self.decoder.model.model.embed_tokens(labels)
+        llm_input = torch.cat((llm_input, labels_embedding), dim=1)
+        llm_labels = labels.clone()
+        llm_labels[llm_labels == 0] = -100
+        _, instruction_embedding_t, _ = instruction_embedding.size()
+        target_ids = (
+            torch.full((B, T + instruction_embedding_t), -100).long().to(labels.device)
+        )
+        llm_labels = torch.cat((target_ids, llm_labels), dim=1)
+        return llm_input, llm_labels
+    def forward(
+        self,
+        source: Dict[str, torch.Tensor],
+        padding_mask: torch.Tensor,
+        target_list: torch.Tensor = None,
+        **kwargs,
+    ) -> CausalLMOutputWithPast:
+        llm_input, llm_labels = self.embed(
+            source, padding_mask, target_list, **kwargs
+        )
+        return self.decoder(
+            inputs_embeds=llm_input.to(torch.float16), labels=llm_labels, return_dict=True
+        )
+    @torch.no_grad()
+    def generate(
+        self,
+        inputs: Optional[Dict[str, torch.Tensor]] = None,
+        generation_config: Optional[GenerationConfig] = None,
+        **kwargs,
+    ) -> Any:
+        llm_input, _ = self.embed(**inputs, **kwargs)
+        self.decoder.config.use_cache = True
+        return self.decoder.generate(
+            inputs_embeds=llm_input,
+            **generation_config,
+            **kwargs,
+        )

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "audio_feat_dim": 104,
+  "auto_map": {
+    "AutoFeatureExtractor": "modelling.AVSPLLMFeatureExtractor"
+  },
+  "feature_extractor_type": "AVSPLLMFeatureExtractor",
+  "num_channels": 1,
+  "num_frames": 76,
+  "size": 88
+}

resnet.py ADDED Viewed

	@@ -0,0 +1,216 @@

+import math
+import torch
+import torch.nn as nn
+from collections import OrderedDict
+def conv3x3(in_channels: int, out_channels: int, stride: int = 1) -> nn.Conv2d:
+    return nn.Conv2d(
+        in_channels=in_channels,
+        out_channels=out_channels,
+        kernel_size=3,
+        stride=stride,
+        padding=1,
+        bias=False
+    )
+def downsample_basic_block(
+    in_channels: int,
+    out_channels: int,
+    stride: int,
+) -> nn.Sequential:
+    return nn.Sequential(
+        nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=stride, bias=False),
+        nn.BatchNorm2d(out_channels),
+    )
+def downsample_basic_block_v2(
+    in_channels: int,
+    out_channels: int,
+    stride: int,
+) -> nn.Sequential:
+    return nn.Sequential(
+        nn.AvgPool2d(
+            kernel_size=stride,
+            stride=stride,
+            ceil_mode=True,
+            count_include_pad=False,
+        ),
+        nn.Conv2d(in_channels, out_channels, kernel_size=1, stride=1, bias=False),
+        nn.BatchNorm2d(out_channels),
+    )
+class BasicBlock(nn.Module):
+    expansion = 1
+    def __init__(
+        self,
+        in_channels: int,
+        channels: int,
+        stride: int = 1,
+        downsample: nn.Sequential = None,
+        relu_type: str = "relu",
+    ) -> None:
+        super(BasicBlock, self).__init__()
+        assert relu_type in ["relu", "prelu"]
+        self.conv1 = conv3x3(in_channels, channels, stride)
+        self.bn1 = nn.BatchNorm2d(channels)
+        if relu_type == "relu":
+            self.relu1 = nn.ReLU(inplace=True)
+            self.relu2 = nn.ReLU(inplace=True)
+        elif relu_type == "prelu":
+            self.relu1 = nn.PReLU(num_parameters=channels)
+            self.relu2 = nn.PReLU(num_parameters=channels)
+        else:
+            raise Exception("relu type not implemented")
+        self.conv2 = conv3x3(channels, channels)
+        self.bn2 = nn.BatchNorm2d(channels)
+        self.downsample = downsample
+        self.stride = stride
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        residual = x
+        out = self.conv1(x)
+        out = self.bn1(out)
+        out = self.relu1(out)
+        out = self.conv2(out)
+        out = self.bn2(out)
+        if self.downsample is not None:
+            residual = self.downsample(x)
+        out += residual
+        out = self.relu2(out)
+        return out
+class ResNet(nn.Module):
+    def __init__(
+        self,
+        block: nn.Module,
+        layers: list,
+        relu_type: str = "relu",
+        gamma_zero: bool = False,
+        avg_pool_downsample: bool = False,
+    ) -> None:
+        self.in_channels = 64
+        self.relu_type = relu_type
+        self.gamma_zero = gamma_zero
+        self.downsample_block = (
+            downsample_basic_block_v2 if avg_pool_downsample else downsample_basic_block
+        )
+        super(ResNet, self).__init__()
+        self.layer1 = self._make_layer(block, 64, layers[0])
+        self.layer2 = self._make_layer(block, 128, layers[1], stride=2)
+        self.layer3 = self._make_layer(block, 256, layers[2], stride=2)
+        self.layer4 = self._make_layer(block, 512, layers[3], stride=2)
+        self.avgpool = nn.AdaptiveAvgPool2d(1)
+        for m in self.modules():
+            if isinstance(m, nn.Conv2d):
+                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+                m.weight.data.normal_(0, math.sqrt(2.0 / n))
+            elif isinstance(m, nn.BatchNorm2d):
+                m.weight.data.fill_(1)
+                m.bias.data.zero_()
+        if self.gamma_zero:
+            for m in self.modules():
+                if isinstance(m, BasicBlock):
+                    m.bn2.weight.data.zero_()
+    def _make_layer(
+        self,
+        block: nn.Module,
+        channels: int,
+        n_blocks: int,
+        stride: int = 1,
+    ) -> nn.Sequential:
+        downsample = None
+        if stride != 1 or self.in_channels != channels * block.expansion:
+            downsample = self.downsample_block(
+                in_channels=self.in_channels,
+                out_channels=channels * block.expansion,
+                stride=stride,
+            )
+        layers = [
+            block(
+                self.in_channels, channels, stride, downsample, relu_type=self.relu_type
+            )
+        ]
+        self.in_channels = channels * block.expansion
+        for _ in range(1, n_blocks):
+            layers.append(block(self.in_channels, channels, relu_type=self.relu_type))
+        return nn.Sequential(*layers)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        x = self.layer1(x)
+        x = self.layer2(x)
+        x = self.layer3(x)
+        x = self.layer4(x)
+        x = self.avgpool(x)
+        x = x.view(x.size(0), -1)
+        return x
+class ResNetEncoder(nn.Module):
+    def __init__(self, relu_type: str, weight_file: str = None) -> None:
+        super(ResNetEncoder, self).__init__()
+        self.frontend_out = 64
+        self.backend_out = 512
+        frontend_relu = (
+            nn.PReLU(num_parameters=self.frontend_out)
+            if relu_type == "prelu"
+            else nn.ReLU()
+        )
+        self.frontend3D = nn.Sequential(
+            nn.Conv3d(
+                1,
+                self.frontend_out,
+                kernel_size=(5, 7, 7),
+                stride=(1, 2, 2),
+                padding=(2, 3, 3),
+                bias=False,
+            ),
+            nn.BatchNorm3d(self.frontend_out),
+            frontend_relu,
+            nn.MaxPool3d(kernel_size=(1, 3, 3), stride=(1, 2, 2), padding=(0, 1, 1)),
+        )
+        self.trunk = ResNet(BasicBlock, [2, 2, 2, 2], relu_type=relu_type)
+        if weight_file is not None:
+            model_state_dict = torch.load(weight_file, map_location=torch.device("cpu"))
+            model_state_dict = model_state_dict["model_state_dict"]
+            frontend_state_dict, trunk_state_dict = OrderedDict(), OrderedDict()
+            for key, val in model_state_dict.items():
+                new_key = ".".join(key.split(".")[1:])
+                if "frontend3D" in key:
+                    frontend_state_dict[new_key] = val
+                if "trunk" in key:
+                    trunk_state_dict[new_key] = val
+            self.frontend3D.load_state_dict(frontend_state_dict)
+            self.trunk.load_state_dict(trunk_state_dict)
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        B, C, T, H, W = x.size()
+        x = self.frontend3D(x)
+        Tnew = x.shape[2]
+        x = self.convert_3D_to_2D(x)
+        x = self.trunk(x)
+        x = x.view(B, Tnew, x.size(1))
+        x = x.transpose(1, 2).contiguous()
+        return x
+    def convert_3D_to_2D(self, x: torch.Tensor) -> torch.Tensor:
+        n_batches, n_channels, s_time, sx, sy = x.shape
+        x = x.transpose(1, 2).contiguous()
+        return x.reshape(n_batches * s_time, n_channels, sx, sy)

utils.py ADDED Viewed

	@@ -0,0 +1,187 @@

+import torch
+import joblib
+import numpy as np
+from io import BytesIO
+from pathlib import Path
+from typing import Tuple, Optional
+from huggingface_hub import HfFileSystem
+def load_kmeans_model(km_path: str) -> Tuple[torch.Tensor, torch.Tensor]:
+    """Load the k-means model."""
+    fs = HfFileSystem()
+    if Path(km_path).exists():
+        km_file = Path(km_path)
+    elif fs.exists(km_path):
+        km_file = BytesIO(fs.read_bytes(km_path))
+    else:
+        raise FileNotFoundError(f"K-means model not found at {km_path}")
+    kmeans_model = joblib.load(km_file)
+    C = torch.from_numpy(kmeans_model.cluster_centers_.transpose())
+    Cnorm = C.pow(2).sum(0, keepdim=True)
+    return C, Cnorm
+def find_runs(x: np.ndarray) -> Tuple[np.ndarray, np.ndarray, np.ndarray]:
+    """Find runs of consecutive items in an array."""
+    # ensure array
+    x = np.asanyarray(x)
+    if x.ndim != 1:
+        raise ValueError("only 1D array supported")
+    n = x.shape[0]
+    # handle empty array
+    if n == 0:
+        return np.array([]), np.array([]), np.array([])
+    else:
+        # find run starts
+        loc_run_start = np.empty(n, dtype=bool)
+        loc_run_start[0] = True
+        np.not_equal(x[:-1], x[1:], out=loc_run_start[1:])
+        run_starts = np.nonzero(loc_run_start)[0]
+        # find run values
+        run_values = x[loc_run_start]
+        # find run lengths
+        run_lengths = np.diff(np.append(run_starts, n))
+        return run_values, run_starts, run_lengths
+def compute_mask_indices(
+    shape: Tuple[int, int],
+    padding_mask: Optional[torch.Tensor],
+    mask_prob: float,
+    mask_length: int,
+    mask_type: str = "static",
+    mask_other: float = 0.0,
+    min_masks: int = 0,
+    no_overlap: bool = False,
+    min_space: int = 0,
+) -> np.ndarray:
+    """
+    Computes random mask spans for a given shape
+    Args:
+        shape: the the shape for which to compute masks.
+            should be of size 2 where first element is batch size and 2nd is timesteps
+        padding_mask: optional padding mask of the same size as shape, which will prevent masking padded elements
+        mask_prob: probability for each token to be chosen as start of the span to be masked. this will be multiplied by
+            number of timesteps divided by length of mask span to mask approximately this percentage of all elements.
+            however due to overlaps, the actual number will be smaller (unless no_overlap is True)
+        mask_type: how to compute mask lengths
+            static = fixed size
+            uniform = sample from uniform distribution [mask_other, mask_length*2]
+            normal = sample from normal distribution with mean mask_length and stdev mask_other. mask is min 1 element
+            poisson = sample from possion distribution with lambda = mask length
+        min_masks: minimum number of masked spans
+        no_overlap: if false, will switch to an alternative recursive algorithm that prevents spans from overlapping
+        min_space: only used if no_overlap is True, this is how many elements to keep unmasked between spans
+    """
+    bsz, all_sz = shape
+    mask = np.full((bsz, all_sz), False)
+    all_num_mask = int(
+        # add a random number for probabilistic rounding
+        mask_prob * all_sz / float(mask_length)
+        + np.random.rand()
+    )
+    all_num_mask = max(min_masks, all_num_mask)
+    mask_idcs = []
+    for i in range(bsz):
+        if padding_mask is not None:
+            sz = all_sz - padding_mask[i].long().sum().item()
+            num_mask = int(
+                # add a random number for probabilistic rounding
+                mask_prob * sz / float(mask_length)
+                + np.random.rand()
+            )
+            num_mask = max(min_masks, num_mask)
+        else:
+            sz = all_sz
+            num_mask = all_num_mask
+        if mask_type == "static":
+            lengths = np.full(num_mask, mask_length)
+        elif mask_type == "uniform":
+            lengths = np.random.randint(mask_other, mask_length * 2 + 1, size=num_mask)
+        elif mask_type == "normal":
+            lengths = np.random.normal(mask_length, mask_other, size=num_mask)
+            lengths = [max(1, int(round(x))) for x in lengths]
+        elif mask_type == "poisson":
+            lengths = np.random.poisson(mask_length, size=num_mask)
+            lengths = [int(round(x)) for x in lengths]
+        else:
+            raise Exception("unknown mask selection " + mask_type)
+        if sum(lengths) == 0:
+            lengths[0] = min(mask_length, sz - 1)
+        if no_overlap:
+            mask_idc = []
+            def arrange(s, e, length, keep_length):
+                span_start = np.random.randint(s, e - length)
+                mask_idc.extend(span_start + i for i in range(length))
+                new_parts = []
+                if span_start - s - min_space >= keep_length:
+                    new_parts.append((s, span_start - min_space + 1))
+                if e - span_start - keep_length - min_space > keep_length:
+                    new_parts.append((span_start + length + min_space, e))
+                return new_parts
+            parts = [(0, sz)]
+            min_length = min(lengths)
+            for length in sorted(lengths, reverse=True):
+                lens = np.fromiter(
+                    (e - s if e - s >= length + min_space else 0 for s, e in parts),
+                    np.int,
+                )
+                l_sum = np.sum(lens)
+                if l_sum == 0:
+                    break
+                probs = lens / np.sum(lens)
+                c = np.random.choice(len(parts), p=probs)
+                s, e = parts.pop(c)
+                parts.extend(arrange(s, e, length, min_length))
+            mask_idc = np.asarray(mask_idc)
+        else:
+            min_len = min(lengths)
+            if sz - min_len <= num_mask:
+                min_len = sz - num_mask - 1
+            mask_idc = np.random.choice(sz - min_len, num_mask, replace=False)
+            mask_idc = np.asarray(
+                [
+                    mask_idc[j] + offset
+                    for j in range(len(mask_idc))
+                    for offset in range(lengths[j])
+                ]
+            )
+        mask_idcs.append(np.unique(mask_idc[mask_idc < sz]))
+    min_len = min([len(m) for m in mask_idcs])
+    batch_indexes, starts, ends = [], [], []
+    for i, mask_idc in enumerate(mask_idcs):
+        if len(mask_idc) > min_len:
+            mask_idc = np.random.choice(mask_idc, min_len, replace=False)
+        mask[i, mask_idc] = True
+        vals, run_starts, run_lengths = find_runs(mask[i])
+        start_indices, lengths = run_starts[vals == True], run_lengths[vals == True]
+        starts.append(start_indices)
+        ends.append(start_indices + lengths)
+        batch_indexes.append(np.zeros([len(start_indices)]) + i)
+    return (
+        mask,
+        np.concatenate(starts).astype(np.int64),
+        np.concatenate(ends).astype(np.int64),
+        np.concatenate(batch_indexes).astype(np.int64),
+    )