updated source code

Browse files

Files changed (5) hide show

src/readme.md +146 -37
src/run_base.sh +44 -0
src/{run.sh → run_small.sh} +5 -3
src/run_speech_recognition_seq2seq_streaming.py +91 -49
src/{run_debug.sh → run_tiny_debug.sh} +0 -0

src/readme.md CHANGED Viewed

@@ -18,11 +18,24 @@ The code in this repository is a modified version of code from
   ```
 ## Fine-tuning todos:
 * perform evaluation of fine-tuned model on CommonVoice test set
 * Learning rate:
   * max learning rate is not the same as LR passed as a parameter to training script. it is actually lower.
   * when resuming training, LR scheduling behaves incorrectly
 * check exact sizes of train, eval, test sets of CommonVoice 11
 ## Resuming training from exising checkpoint
 When resuming training from existing checkpoint:
@@ -55,6 +68,138 @@ When resuming training from existing checkpoint:
   How is it overwritten when resuming training from existing checkpoint?
 * does `ShuffleCallback` work with StreamingDataset? it reshuffles data `on_epoch_begin()`,
   but does StreamingDataset have any epochs?
 ### Prepended tokens
 * Why are there following lines in Data Collator?
@@ -90,40 +235,4 @@ When resuming training from existing checkpoint:
   * We need to tell the model what language the audio corresponds to and what task it's performing during fine-tuning. This way, it learns what audio corresponds to what language, and the difference between transcribing audio vs translating it
-## Notes:
-* using CommonVoice 11 dataset in a streaming way.<br>
-  use `streaming=True` for train & validation & test.<br>
-  as an alternative, we can use `streaming=False` for validation & test sets to save time on data processing.
-  but the size of validation and test sets are unknown (need to check).
-  it's likely they are going to be large - thus pre-download of these sets might not reduce
-  overall fine-tuning time compared to streaming mode.
-* size of train set is ~370'000 audiofiles. if using `batch_size=64`, then
-  1 epoch will have ~5782 steps. <br>
-  Because of `--eval_steps="1000"` will use `--max_steps="6000"` instead of `--max_steps="5800"`
-  to have evaluation metrics computed in the end of training.
-* if using Google Colab, need to execute  `sudo chmod -R 777 .git` inside hf repo to
-  to set right permissions to be able to push trained models to HuggingFace Hub
-* Whispers BasicTextNormalizer splits words containing apostrophe:
-  ```python
-  > from transformers.models.whisper.english_normalizer import BasicTextNormalizer
-  > normalizer = BasicTextNormalizer()
-  > normalizer("раз'яднаць")
-  'раз яднаць'
-  ```
-* That's why `BelarusianTextNormalizer` (edited version of `BasicTextNormalizer`) was added to training script:
-  ```python
-  > from run_speech_recognition_seq2seq_streaming import BelarusianTextNormalizer
-  > normalizer_be = BelarusianTextNormalizer()
-  > normalizer_be("раз'яднаць")
-  "раз'яднаць"
-  ```
-* Need to set `use_cache` to False since we're using gradient checkpointing, and the two are incompatible
-* Default Linear scheduler is used
-* Default Adam optimizer is used
-* To save memory (and increase either model or batch_size) can experiment with:
-    * using Adafactor instead of Adam.
-      Adam requires two optimiser params per one model param, but Adafactor uses only one.
-      > A word of caution: Adafactor is untested for fine-tuning Whisper,
-        so we are unsure sure how Adafactor performance compares to Adam!
-    * using Adam 8bit from `bitsandbytes` module.
-      need to provide `optim="adamw_bnb_8bit"` param to `Seq2SeqTrainingArguments`

   ```
 ## Fine-tuning todos:
+* logs are printed only right before the evalutaion:<br>
+  ```
+  --logging_steps="50"
+  --eval_steps="1000"
+  ```
+* on the next run:
+  * download the whole dataset before the launch.
+    this will probably save some time for data processing,
+    and allow to load and prepare data in a parallel fashion
+  * can also decrease eval batch size. currently it's probably causing GPU to wait for CPU to prepare a next batch
 * perform evaluation of fine-tuned model on CommonVoice test set
+* add [Whisper fine-tuning Event repo](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event)
+  to remotes and merge updates from this original event repo
 * Learning rate:
   * max learning rate is not the same as LR passed as a parameter to training script. it is actually lower.
   * when resuming training, LR scheduling behaves incorrectly
 * check exact sizes of train, eval, test sets of CommonVoice 11
+* fill TODOs in Notes section with answers and discussions from a Discord
 ## Resuming training from exising checkpoint
 When resuming training from existing checkpoint:
   How is it overwritten when resuming training from existing checkpoint?
 * does `ShuffleCallback` work with StreamingDataset? it reshuffles data `on_epoch_begin()`,
   but does StreamingDataset have any epochs?
+* does streaming mode support parallel data load and processing?<br>
+  when using non-streaming mode we can use `dataset.map(..., num_proc=<num_proc>)`
+## Notes:
+* using CommonVoice 11 dataset in a streaming way.<br>
+  use `streaming=True` for train & validation & test.<br>
+  as an alternative, we can use `streaming=False` for validation & test sets to save time on data processing.
+  but the size of validation and test sets are unknown (need to check).
+  it's likely they are going to be large - thus pre-download of these sets might not reduce
+  overall fine-tuning time compared to streaming mode.
+* size of train set is ~370'000 audiofiles. if using `batch_size=64`, then
+  1 epoch will have ~5782 steps. <br>
+  Because of `--eval_steps="1000"` will use `--max_steps="6000"` instead of `--max_steps="5800"`
+  to have evaluation metrics computed in the end of training.
+* if using Google Colab, need to execute  `sudo chmod -R 777 .git` inside hf repo to
+  to set right permissions to be able to push trained models to HuggingFace Hub
+* Log tracking in Jupyter (not working) and in bash (works as expected with `tee`)
+* Loggers in `run_speech.....py` do not control `transformers` and `datasets` loggers.
+  can't redirect their outputs using handlers. it's better and easier to redirect output in a bash
+* Need to set `use_cache` to False since we're using gradient checkpointing, and the two are incompatible
+* Default Linear scheduler is used
+* Default Adam optimizer is used
+### Logs not printed when expected
+* Train logs are printed only before start of a validation.
+  During training they are not printed to a stdout.
+  All worked fine in a Colab.
+* No progressbar for validation (at least when using streaming and iterable dataset).
+  possible reason is that when using streaming, the dataset len in unknown.
+* Evaluation metrics get printed to stdout only before the next validation call.
+  All worked fine in a Colab.
+* Possible reason: usage of `... | tee file.log`. But it's unlikely
+### Text normalization
+* Whispers BasicTextNormalizer splits words containing apostrophe:
+  ```python
+  > from transformers.models.whisper.english_normalizer import BasicTextNormalizer
+  > normalizer = BasicTextNormalizer()
+  > normalizer("раз'яднаць")
+  'раз яднаць'
+  ```
+* That's why `BelarusianTextNormalizer` (edited version of `BasicTextNormalizer`) was added to training script:
+  ```python
+  > from run_speech_recognition_seq2seq_streaming import BelarusianTextNormalizer
+  > normalizer_be = BelarusianTextNormalizer()
+  > normalizer_be("раз'яднаць")
+  "раз'яднаць"
+  ```
+### Different batch sizes for train and evaluation:
+* Theoretically you can use a larger batch size for evaluation vs training!
+* Training: we do a forward pass, storing all the activations, and then a backwards pass, storing all the gradients
+* Inference (evaluation): we only do a forward pass, and don't store any activations
+* So the memory required for evaluation is much lower than it is for training
+  (we're only doing the forward pass and not storing any values)
+* In my experience, altering the eval batch size has little effect on eval speed ->
+  I set it to a lower value as this tends to give a more responsive progress bar
+  when evaluating in non-streaming mode (the bar updates faster and more frequently)
+### Slow inference. Long evalutaion compared to training:
+* Slower inference is an inherent limitation of the sequence-to-sequence architecture.
+  The auto-regressive decoding means that you have to do as many decoder forward passes as tokens generated.
+* This is much slower than CTC, where you do a single encoder forward pass
+* Note that 1 evaluation step **will take much longer** than 1 training step, even with the same batch sizes.
+  * With training, we do one forward pass of the encoder, one forward pass of the decoder,
+    one backward pass of the decoder and one backward pass of the encoder (=4 passes total):<br>
+    ```
+    audio -> encoder -> decoder -> labels
+              encoder <- decoder <- loss
+    ```
+  * During evaluation we do one forward pass of the encoder, and then auto-regressively generate tokens in the decoder.
+    Here, we do as many forward passes of the decoder as tokens generated.
+    So in total, we do one forward pass of the encoder, and N forward passes of the decoder,
+    where N is the number of tokens generated (can be up to the max length, which is 448...).
+    You can see that for 4 or more generated tokens, evaluation is going to be slower than training:<br>
+    ```
+    audio -> encoder -> decoder -> decoder -> decoder -> ... -> decoder -> end of sentence token
+    ```
+* I've made a bit of a simplification here in saying that one forward pass
+  takes the same amount of time as one backward pass, but for the purpose of illustrating,
+  this demonstrates the point why evaluation is much slower than training
+* Essentially it doesn't really matter what you set your eval batch size as we're not aggregating any statistics
+  over the eval batch (in contrast during training we evaluate a true gradient value based on a given batch).
+  * Since we just do a forward pass, we could even run eval with a batch size of 1 and get exactly the same results!
+  * Because we don't get much of an improvement with batch sizes beyond around 8, it's set somewhat arbitrarily
+### Ways to decrease evaluation time during fine-tuning:
+* reduce `generation_max_length` param:
+  * During training, we can limit the generation max length to a lower number to cut-off the generation
+    after fewer tokens (e.g. 40). This will give worse results during training,
+    but we can still infer the evolution of WER performance over training.
+  * For the final eval step, we can bump up the generation max length back up to 448.
+  * WER performance varies monotonically with generation max length
+    (WER can only stay equal or improve by increasing generation max length),
+    so we know that our final eval WER will be less than (improved) or equal to the WER during training
+* We can evaluate at less frequent eval_steps: this reduces the number of times we have to perform evaluation
+### Decrease inference time more generally
+* PyTorch 2.0 and compiling the model could get you a decent speed-up
+  (https://pytorch.org/blog/Accelerating-Hugging-Face-and-TIMM-models/#hugging-face-models)
+* Downcasting to fp16
+### Memory saving and training larger models:
+To save memory (and increase either model or batch_size) can experiment with:
+* using Adafactor instead of Adam.
+  Adam requires two optimiser params per one model param, but Adafactor uses only one.
+  > A word of caution: Adafactor is untested for fine-tuning Whisper,
+    so we are unsure sure how Adafactor performance compares to Adam!
+* using Adam 8bit from `bitsandbytes` module.
+  need to provide `optim="adamw_bnb_8bit"` param to `Seq2SeqTrainingArguments`
+* use `deepspeed`. scripts are there in
+  [Whisper fine-tuning Event repo](https://github.com/huggingface/community-events/tree/main/whisper-fine-tuning-event)
+* load the model and processor in 8bit mode:
+  ```python
+  from transformers import WhisperForConditionalGeneration, WhisperProcessor
+  model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-large", device_map="auto", load_in_8bit=True)
+  processor = WhisperProcessor.from_pretrained("openai/whisper-large", load_in_8bit=True)
+  ```
+  inference loop:
+  ```python
+  for data in dataset:
+    inputs = processor.feature_extractor(data["audio"]["array"], return_tensors="pt", sampling_rate=16_000).input_features.half().to(device)
+    forced_decoder_ids = processor.get_decoder_prompt_ids(language="en", task="transcribe")
+    predicted_ids = model.generate(inputs, forced_decoder_ids=forced_decoder_ids)
+    text = processor.tokenizer.batch_decode(predicted_ids, skip_special_tokens=True, normalize=False)[0]
+    print(text)
+  ```
+  * 8bit will slower iference compared to full/half-precision
+  * But the memory saving you get is immense (up to 4x vs full-precision).<br>
+    This is the recommended approach when you're limited on VRAM.<br>
+    If you care about inference speed, still to full precision
 ### Prepended tokens
 * Why are there following lines in Data Collator?
   * We need to tell the model what language the audio corresponds to and what task it's performing during fine-tuning. This way, it learns what audio corresponds to what language, and the difference between transcribing audio vs translating it

src/run_base.sh ADDED Viewed

	@@ -0,0 +1,44 @@

+python src/run_speech_recognition_seq2seq_streaming.py \
+	--model_name_or_path="openai/whisper-base" \
+	--dataset_name="mozilla-foundation/common_voice_11_0" \
+	--dataset_config_name="be" \
+	--language="be" \
+	--train_split_name="train" \
+	--eval_split_name="validation" \
+	--model_index_name="Whisper Base Belarusian" \
+    \
+	--max_steps="6000" \
+	--output_dir="./" \
+	--per_device_train_batch_size="64" \
+	--per_device_eval_batch_size="32" \
+	--logging_steps="50" \
+	--logging_first_step \
+	--learning_rate="1e-4" \
+	--warmup_steps="500" \
+	--evaluation_strategy="steps" \
+	--eval_steps="1000" \
+	--save_strategy="steps" \
+	--save_steps="1000" \
+	--gradient_checkpointing \
+	--fp16 \
+    \
+	--shuffle_buffer_size="500" \
+	--generation_max_length="225" \
+	--max_duration_in_seconds="30" \
+	--text_column_name="sentence" \
+	--freeze_feature_encoder="False" \
+	--report_to="tensorboard" \
+	--metric_for_best_model="wer" \
+	--greater_is_better="False" \
+	--load_best_model_at_end \
+    \
+	--do_train \
+	--do_eval \
+	--ignore_data_skip \
+	--predict_with_generate \
+	--do_normalize_eval \
+	--streaming_train="True" \
+	--streaming_eval="False" \
+	--use_auth_token \
+	--push_to_hub \
+	--hub_model_id="ales/whisper-base-belarusian"

src/{run.sh → run_small.sh} RENAMED Viewed

@@ -7,10 +7,10 @@ python src/run_speech_recognition_seq2seq_streaming.py \
 	--eval_split_name="validation" \
 	--model_index_name="Whisper Small Belarusian" \
     \
-	--max_steps="12000" \
 	--output_dir="./" \
 	--per_device_train_batch_size="64" \
-	--per_device_eval_batch_size="64" \
 	--logging_steps="50" \
 	--logging_first_step \
 	--learning_rate="1e-4" \
@@ -34,10 +34,12 @@ python src/run_speech_recognition_seq2seq_streaming.py \
     \
 	--do_train \
 	--do_eval \
 	--ignore_data_skip \
 	--predict_with_generate \
 	--do_normalize_eval \
-	--streaming \
 	--use_auth_token \
 	--push_to_hub \
 	--hub_model_id="ales/whisper-small-belarusian"

 	--eval_split_name="validation" \
 	--model_index_name="Whisper Small Belarusian" \
     \
+	--max_steps="18000" \
 	--output_dir="./" \
 	--per_device_train_batch_size="64" \
+	--per_device_eval_batch_size="32" \
 	--logging_steps="50" \
 	--logging_first_step \
 	--learning_rate="1e-4" \
     \
 	--do_train \
 	--do_eval \
+	--resume_from_checkpoint="checkpoint-12000" \
 	--ignore_data_skip \
 	--predict_with_generate \
 	--do_normalize_eval \
+	--streaming_train="True" \
+	--streaming_eval="False" \
 	--use_auth_token \
 	--push_to_hub \
 	--hub_model_id="ales/whisper-small-belarusian"

src/run_speech_recognition_seq2seq_streaming.py CHANGED Viewed

@@ -220,9 +220,13 @@ class DataTrainingArguments:
             )
         },
     )
-    streaming: bool = field(
         default=True,
-        metadata={"help": "Whether to use streaming mode to load and pre-process the data."},
     )
@@ -360,12 +364,14 @@ def main():
         f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
         f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
     )
     logger.info(f"Training/evaluation parameters {training_args}")
     # Set the verbosity to info of the Transformers logger (on main process only):
     if is_main_process(training_args.local_rank):
         transformers.utils.logging.set_verbosity_info()
-    logger.info("Training/evaluation parameters %s", training_args)
     # 3. Detecting last checkpoint and eventually continue from last checkpoint
     last_checkpoint = None
@@ -423,27 +429,31 @@ def main():
     set_seed(training_args.seed)
     # 4. Load dataset
-    raw_datasets = IterableDatasetDict() if data_args.streaming else DatasetDict()
     if training_args.do_train:
-        raw_datasets["train"] = load_maybe_streaming_dataset(
             data_args.dataset_name,
             data_args.dataset_config_name,
             split=data_args.train_split_name,
             use_auth_token=True if model_args.use_auth_token else None,
-            streaming=data_args.streaming,
         )
     if training_args.do_eval:
-        raw_datasets["eval"] = load_maybe_streaming_dataset(
             data_args.dataset_name,
             data_args.dataset_config_name,
             split=data_args.eval_split_name,
             use_auth_token=True if model_args.use_auth_token else None,
-            streaming=data_args.streaming,
         )
-    raw_datasets_features = list(next(iter(raw_datasets.values())).features.keys())
     if data_args.audio_column_name not in raw_datasets_features:
         raise ValueError(
@@ -510,7 +520,13 @@ def main():
         tokenizer.set_prefix_tokens(language=data_args.language, task=data_args.task)
     # 6. Explicitly resample speech dataset
-    raw_datasets = raw_datasets.cast_column(
         data_args.audio_column_name, datasets.features.Audio(
             sampling_rate=feature_extractor.sampling_rate,
             mono=True
@@ -531,60 +547,84 @@ def main():
     normalizer = BelarusianTextNormalizer()  # custom normalizer based on 'official' text normalizer from OpenAI
     if data_args.max_train_samples is not None:
-        raw_datasets["train"] = (
-            raw_datasets["train"].take(data_args.max_train_samples)
-            if data_args.streaming
-            else raw_datasets["train"].select(range(data_args.max_train_samples))
         )
     if data_args.max_eval_samples is not None:
-        raw_datasets["eval"] = (
-            raw_datasets["eval"].take(data_args.max_eval_samples)
-            if data_args.streaming
-            else raw_datasets["eval"].select(range(data_args.max_eval_samples))
         )
-    def prepare_dataset(batch, labels_max_len: int = None):
         # process audio
-        sample = batch[audio_column_name]
-        inputs = feature_extractor(sample["array"], sampling_rate=sample["sampling_rate"])
         # process audio length
-        batch[model_input_name] = inputs.get(model_input_name)[0]
-        batch["input_length"] = len(sample["array"])
         # process targets
-        input_str = batch[text_column_name].lower() if do_lower_case else batch[text_column_name]
         if do_remove_punctuation:
             input_str = normalizer(input_str).strip()
-        batch['labels'] = tokenizer(input_str).input_ids
-        batch['labels_length'] = len(batch['labels'])  # include special characters
-        batch['labels_truncated'] = 0
         # need to truncate validation and test labels that are longer that model.config.max_length.
         # can't drop such examples because this will affect validation and test scores.
         # thus need to truncate.
         if labels_max_len is not None:
-            if len(batch['labels']) > labels_max_len:
-                batch['labels'] = batch['labels'][:labels_max_len]
-                batch['labels_truncated'] = 1
-        return batch
     with training_args.main_process_first(desc="dataset map pre-processing"):
-        vectorized_datasets = IterableDatasetDict()
-        vectorized_datasets['train'] = raw_datasets['train'].map(
-            prepare_dataset, remove_columns=raw_datasets_features,
-            fn_kwargs=dict(labels_max_len=None),
-        ).with_format("torch")
-        vectorized_datasets['eval'] = raw_datasets['eval'].map(
-            prepare_dataset, remove_columns=raw_datasets_features,
-            fn_kwargs=dict(labels_max_len=max_labels_length),
-        ).with_format("torch")
-        if training_args.do_train and data_args.streaming:
             # manually shuffle if streaming (done by the trainer for non-streaming)
-            vectorized_datasets["train"] = vectorized_datasets["train"].shuffle(
                 buffer_size=data_args.shuffle_buffer_size,
                 seed=training_args.seed,
             )
@@ -601,11 +641,11 @@ def main():
     if training_args.do_train:
         # Filter items from train set only.
         # Should keep them in eval set not to affect eval metrics.
-        vectorized_datasets["train"] = vectorized_datasets["train"].filter(
             is_audio_in_length_range,
             input_columns=["input_length"],
         )
-        vectorized_datasets["train"] = vectorized_datasets["train"].filter(
             are_labels_in_length_range,
             input_columns=["labels_length"],
         )
@@ -657,18 +697,20 @@ def main():
             if isinstance(train_dataloader.dataset, IterableDatasetShard):
                 pass  # set_epoch() is handled by the Trainer
             elif isinstance(train_dataloader.dataset, IterableDataset):
                 train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)
     # Initialize Trainer
     trainer = Seq2SeqTrainer(
         model=model,
         args=training_args,
-        train_dataset=vectorized_datasets["train"] if training_args.do_train else None,
-        eval_dataset=vectorized_datasets["eval"] if training_args.do_eval else None,
         tokenizer=processor,
         data_collator=data_collator,
         compute_metrics=compute_metrics if training_args.predict_with_generate else None,
-        callbacks=[ShuffleCallback()] if data_args.streaming else None,
     )
     # 12. Training

             )
         },
     )
+    streaming_train: bool = field(
         default=True,
+        metadata={"help": "Whether to use streaming mode to load and pre-process the train split."},
+    )
+    streaming_eval: bool = field(
+        default=True,
+        metadata={"help": "Whether to use streaming mode to load and pre-process the evaluation split."},
     )
         f"Process rank: {training_args.local_rank}, device: {training_args.device}, n_gpu: {training_args.n_gpu}"
         f"distributed training: {bool(training_args.local_rank != -1)}, 16-bits training: {training_args.fp16}"
     )
     logger.info(f"Training/evaluation parameters {training_args}")
+    logger.info(f"Data parameters: {data_args}")
+    logger.info(f"Model parameters: {model_args}")
     # Set the verbosity to info of the Transformers logger (on main process only):
     if is_main_process(training_args.local_rank):
         transformers.utils.logging.set_verbosity_info()
     # 3. Detecting last checkpoint and eventually continue from last checkpoint
     last_checkpoint = None
     set_seed(training_args.seed)
     # 4. Load dataset
+    # TODO: replace dataset dicts with single key to IterableDataset and to Dataset.
+    # don't know how to do it know - using dict simply because they work.
+    raw_train = IterableDatasetDict() if data_args.streaming_train else DatasetDict()
+    raw_eval = IterableDatasetDict() if data_args.streaming_eval else DatasetDict()
     if training_args.do_train:
+        raw_train['train'] = load_maybe_streaming_dataset(
             data_args.dataset_name,
             data_args.dataset_config_name,
             split=data_args.train_split_name,
             use_auth_token=True if model_args.use_auth_token else None,
+            streaming=data_args.streaming_train,
         )
     if training_args.do_eval:
+        raw_eval['eval'] = load_maybe_streaming_dataset(
             data_args.dataset_name,
             data_args.dataset_config_name,
             split=data_args.eval_split_name,
             use_auth_token=True if model_args.use_auth_token else None,
+            streaming=data_args.streaming_eval,
         )
+    raw_datasets_features = list(next(iter(raw_train.values())).features.keys())
     if data_args.audio_column_name not in raw_datasets_features:
         raise ValueError(
         tokenizer.set_prefix_tokens(language=data_args.language, task=data_args.task)
     # 6. Explicitly resample speech dataset
+    raw_train = raw_train.cast_column(
+        data_args.audio_column_name, datasets.features.Audio(
+            sampling_rate=feature_extractor.sampling_rate,
+            mono=True
+        )
+    )
+    raw_eval = raw_eval.cast_column(
         data_args.audio_column_name, datasets.features.Audio(
             sampling_rate=feature_extractor.sampling_rate,
             mono=True
     normalizer = BelarusianTextNormalizer()  # custom normalizer based on 'official' text normalizer from OpenAI
     if data_args.max_train_samples is not None:
+        raw_train['train'] = (
+            raw_train['train'].take(data_args.max_train_samples)
+            if data_args.streaming_train
+            else raw_train['train'].select(range(data_args.max_train_samples))
         )
     if data_args.max_eval_samples is not None:
+        raw_eval['eval'] = (
+            raw_eval['eval'].take(data_args.max_eval_samples)
+            if data_args.streaming_eval
+            else raw_eval['eval'].select(range(data_args.max_eval_samples))
         )
+    def prepare_dataset(sample, labels_max_len: int = None):
         # process audio
+        audio = sample[audio_column_name]
+        inputs = feature_extractor(audio["array"], sampling_rate=audio["sampling_rate"])
         # process audio length
+        sample[model_input_name] = inputs.get(model_input_name)[0]
+        sample["input_length"] = len(audio["array"])
         # process targets
+        input_str = sample[text_column_name].lower() if do_lower_case else sample[text_column_name]
         if do_remove_punctuation:
             input_str = normalizer(input_str).strip()
+        sample['labels'] = tokenizer(input_str).input_ids
+        sample['labels_length'] = len(sample['labels'])  # include special characters
+        sample['labels_truncated'] = 0
         # need to truncate validation and test labels that are longer that model.config.max_length.
         # can't drop such examples because this will affect validation and test scores.
         # thus need to truncate.
         if labels_max_len is not None:
+            if len(sample['labels']) > labels_max_len:
+                sample['labels'] = sample['labels'][:labels_max_len]
+                sample['labels_truncated'] = 1
+        return sample
     with training_args.main_process_first(desc="dataset map pre-processing"):
+        logger.info(f'vectorizing dataset')
+        # TODO: replace dataset dicts with single key to IterableDataset and to Dataset.
+        # don't know how to do it know - using dict simply because they work.
+        vectorized_train = IterableDatasetDict() if data_args.streaming_train else DatasetDict()
+        vectorized_eval = IterableDatasetDict() if data_args.streaming_eval else DatasetDict()
+        num_proc = None
+        if data_args.streaming_train or data_args.streaming_eval:
+            logger.info(f'will preprocess data using {num_proc} processes.')
+        if data_args.streaming_train:
+            vectorized_train['train'] = raw_train['train'].map(
+                prepare_dataset, remove_columns=raw_datasets_features,
+                fn_kwargs=dict(labels_max_len=None),
+            ).with_format("torch")
+        else:
+            vectorized_train['train'] = raw_train['train'].map(
+                prepare_dataset, remove_columns=raw_datasets_features,
+                num_proc=num_proc,
+                fn_kwargs=dict(labels_max_len=None),
+            ).with_format("torch")
+        if data_args.streaming_eval:
+            vectorized_eval['eval'] = raw_eval['eval'].map(
+                prepare_dataset, remove_columns=raw_datasets_features,
+                fn_kwargs=dict(labels_max_len=max_labels_length),
+            ).with_format("torch")
+        else:
+            vectorized_eval['eval'] = raw_eval['eval'].map(
+                prepare_dataset, remove_columns=raw_datasets_features,
+                num_proc=num_proc,
+                fn_kwargs=dict(labels_max_len=max_labels_length),
+            ).with_format("torch")
+        if training_args.do_train and data_args.streaming_train:
             # manually shuffle if streaming (done by the trainer for non-streaming)
+            vectorized_train['train'] = vectorized_train['train'].shuffle(
                 buffer_size=data_args.shuffle_buffer_size,
                 seed=training_args.seed,
             )
     if training_args.do_train:
         # Filter items from train set only.
         # Should keep them in eval set not to affect eval metrics.
+        vectorized_train['train'] = vectorized_train['train'].filter(
             is_audio_in_length_range,
             input_columns=["input_length"],
         )
+        vectorized_train['train'] = vectorized_train['train'].filter(
             are_labels_in_length_range,
             input_columns=["labels_length"],
         )
             if isinstance(train_dataloader.dataset, IterableDatasetShard):
                 pass  # set_epoch() is handled by the Trainer
             elif isinstance(train_dataloader.dataset, IterableDataset):
+                logger.info(f'ShuffleCallback. shuffling train dataset. '
+                            f'seed: {training_args.seed}. dataset epoch: {train_dataloader.dataset._epoch}')
                 train_dataloader.dataset.set_epoch(train_dataloader.dataset._epoch + 1)
     # Initialize Trainer
     trainer = Seq2SeqTrainer(
         model=model,
         args=training_args,
+        train_dataset=vectorized_train['train'] if training_args.do_train else None,
+        eval_dataset=vectorized_eval['eval'] if training_args.do_eval else None,
         tokenizer=processor,
         data_collator=data_collator,
         compute_metrics=compute_metrics if training_args.predict_with_generate else None,
+        callbacks=[ShuffleCallback()] if data_args.streaming_train else None,
     )
     # 12. Training

src/{run_debug.sh → run_tiny_debug.sh} RENAMED Viewed

File without changes