sander-wood
/

bgpt

Model card Files Files and versions Community

sander-wood commited on Mar 17

Commit

7ade5b9

•

1 Parent(s): 85c6db6

Update README.md

Browse files

Files changed (1) hide show

README.md +9 -29

README.md CHANGED Viewed

@@ -74,7 +74,7 @@ The `config.py` file contains critical settings for training and inference, allo
 - **TRAIN_FOLDERS**: Specify the dataset folders for training. Multiple folders can be included.
 - **EVAL_FOLDERS**: Specify evaluation dataset folders.
-- **PRETRAINED_PATH**: Path to pre-trained weights for transfer learning and fine-tuning.
 - **WEIGHTS_PATH & LOGS_PATH**: Define locations to save trained weights and logs, respectively.
 - **NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE**: Control training duration, learning rate, and batch size for optimal learning.
 - **ACCUMULATION_STEPS**: Set accumulation steps to emulate larger batch sizes, managing memory usage efficiently.
@@ -92,39 +92,19 @@ Generative modelling with bGPT is a flexible and powerful approach to learning a
 1. **Prepare Your Data**: bGPT can handle any computer file type, including text, images, audio, executables, and encrypted or proprietary formats, without needing specific adjustments for each modality. This capability allows for straightforward and versatile training on a wide array of data. The only thing you need to do here is simply to split your data for training and evaluation.
-2. **Adjust Configuration Settings**: Modify the `config.py` file to tailor the training process to your needs. At a minimum, you should update the `TRAIN_FOLDERS` and `EVAL_FOLDERS` to point to your actual data directories. Also, specify where to save the trained model weights and logs by setting `WEIGHTS_PATH` and `LOGS_PATH`. You may adjust other parameters based on your specific requirements. For instance, with the default `PATCH_SIZE=16` and `PATCH_LENGTH=512`, bGPT can model byte sequences up to 8KB. If your training files are larger, and you have sufficient computational resources, consider increasing these parameters to accommodate the larger file sizes. By default, the `CONVERSION_MODE` is set to `None`. If you wish to engage in data conversion tasks, please refer to the below section for instructions on setting up the conversion mode.
-3. **Leverage Pre-trained Weights (Optional)**: If you wish to fine-tune a pre-trained bGPT model, set `PRETRAINED_PATH` to the location of the pre-trained weights and ensure `LOAD_FROM_PRETRAINED=True`. To train a model from scratch, simply set `LOAD_FROM_PRETRAINED=False`.
 4. **Start Training**: Run `train-gen.py` to begin the training process. The script will use the configurations set in `config.py` and apply the training data to learn generative models capable of producing new, unseen outputs in the format of your training data.
-### Data Conversion
-The conversion mode in bGPT adds a specialized functionality for transforming data from one format to another, leveraging the model's understanding of byte sequences across different file types. This mode supports both unidirectional and bidirectional conversions, enabling a wide range of data transformation tasks. Here's how to utilize the conversion mode effectively:
-1. **Define Conversion Mode**: In your `config.py` file, you'll define the `CONVERSION_MODE` setting, which governs how files are transformed. This setting offers two options: unidirectional and bidirectional conversion, denoted by `"->"` and `"&"` respectively.
-   - Unidirectional Conversion: Denoted by `"->"`, this mode signifies a one-way transformation from one format to another. For instance, if you want to convert text files to HTML, you'd set `CONVERSION_MODE = "txt->html"`. This means the model will learn to convert text files specifically into HTML format, but not vice versa.
-   - Bidirectional Conversion: Denoted by `"&"`, this mode implies a two-way transformation between formats. For example, setting `CONVERSION_MODE = "wav&mp3"` instructs the model to learn bidirectional conversion between WAV and MP3 audio formats. In this mode, the model learns to convert files from WAV to MP3 and vice versa, allowing flexibility in both directions of conversion.
-2. **Prepare Your Data**: Ensure your data pairs are stored within the same directory path in both `TRAIN_FOLDERS` and `EVAL_FOLDERS`. Each pair should share identical paths, including filenames, differing only in their file extensions. For instance, if converting between WAV and MP3 formats, ensure files like "path/audio.wav" and "path/audio.mp3" are paired accordingly. This strict alignment guarantees the script correctly associates files for conversion based on the specified mode.
-3. **Adjust Training Parameters**: Although the conversion mode operates under the same training principles as generative modelling, you might want to adjust certain parameters in `config.py` to optimize the conversion process. This could include tuning the `PATCH_SIZE` and `PATCH_LENGTH` settings to better accommodate the file sizes commonly encountered in your conversion tasks.
-4. **Leverage Pre-trained Weights (Optional)**: Same as regular generative modelling, if you wish to fine-tune a pre-trained bGPT model, set `PRETRAINED_PATH` to the location of the pre-trained weights and ensure `LOAD_FROM_PRETRAINED=True`. To train a model from scratch, simply set `LOAD_FROM_PRETRAINED=False`.
-5. **Start Training for Conversion**: When training bGPT in conversion mode, the model learns to map byte sequences from the source format to the target format (or vice versa in bidirectional mode). Execute `train-gen.py` to start the training process, ensuring that the `CONVERSION_MODE` is correctly set in your configuration file.
-By leveraging the conversion mode, bGPT enables simulating and reverse engineering the behaviors of algorithms or hardware through paired inputs and outputs, opening up new possibilities for data processing and content generation tasks.
 ### Classification
 Classification with bGPT leverages the model's ability to understand and differentiate between various types of data at a fundamental level. This involves extracting a global feature from the byte sequence, which is then processed by a classification head. Here's how to approach classification tasks with bGPT:
 1. **Prepare Labelled Data**: Ensure your dataset consists of labelled data, which can be a mix of different formats. The model distinguishes between data types using the naming convention `label.ext`, where the label is derived from the filename, specifically `filename.split('_')[0]`. This means that the label for classification should be clearly reflected in the file name, such as "Business_1.txt". It is crucial to organize your files accordingly to facilitate accurate classification.
-2. **Generative Modelling Before Classification (Strongly Recommended)**: Before embarking on classification tasks, it is highly recommended to perform generative modelling on the same dataset. Starting with weights trained through generative modelling provides a solid foundation for further fine-tuning in classification tasks. To do this, set `PRETRAINED_PATH` to your generative model weights and ensure `LOAD_FROM_PRETRAINED=True`. Directly training a classification model from scratch without this pre-training step has been observed to result in significantly poorer performance. When fine-tuning for classification, ensure that `WEIGHTS_PATH` and `LOGS_PATH` are set to different locations to prevent overwriting previous models. Note that the classification model will inherit the bGPT's patch-level decoder and discard the byte-level decoder, so it's essential to keep the model parameters unchanged during this phase.
 3. **Start Training for Classification**: Run `train-cls.py` to begin the classification training process. The script will utilize the previously set configurations and apply them to your labelled dataset. The model will learn to classify the input data into the defined categories based on the labels extracted from the filenames.
@@ -134,15 +114,15 @@ Inference with bGPT allows you to apply trained models to new data, performing t
 1. **Set Up the Inference Configuration**: First, ensure your `config.py` includes the configuration for inference as shown above. Adjust the parameters according to the task you want to perform:
-   - `INFERENCE_WEIGHTS_PATH`: This should point to the location of your trained model weights that you intend to use for inference. For example, `"weights-conversion"` indicates the model trained for converting files from one format to another.
    - `INPUT_EXT` and `TARGET_EXT`: These parameters define the extensions of the input and target files, respectively. In the given configuration, the model expects input files with the `.mid` extension and will aim to convert them into files with the `.abc` extension.
-   - `MODE`: Determines the mode of inference. `"convert"` mode is used for converting files from one format to another, while `"generate"` mode is used for generating new content.
    - `NUM_SAMPLES`, `TOP_K`, `TOP_P`, and `TEMPERATURE`: These parameters control the sampling strategy for generation tasks, influencing the diversity and creativity of the output.
 2. **Performing Conversion or Generation**: Depending on the `MODE` you've set, the inference process will either convert input files to a new format or generate new content:
-   - **Conversion**: In `"convert"` mode, ensure your input files (e.g., `.mid`) are placed in a designated directory. The model will read these files, apply the conversion process, and output files in the target format (e.g., `.abc`) in the specified output directory.
-   - **Generation**: In `"generate"` mode, the model will generate samples from scratch. The number of samples to generate is specified by `NUM_SAMPLES`. The generated samples will be placed in the output directory.
 4. **Executing Inference**: To start the inference process, run the script `inference.py`. Make sure the script is configured to use the settings from `config.py` for conducting the desired inference task.
@@ -180,4 +160,4 @@ The CPU States Dataset contains 2.1 million instances, each featuring a 1KB memo
       archivePrefix={arXiv},
       primaryClass={cs.LG}
 }
-```

 - **TRAIN_FOLDERS**: Specify the dataset folders for training. Multiple folders can be included.
 - **EVAL_FOLDERS**: Specify evaluation dataset folders.
+- **PRE_WEIGHTS_PATH**: Path to pre-trained weights for transfer learning and fine-tuning.
 - **WEIGHTS_PATH & LOGS_PATH**: Define locations to save trained weights and logs, respectively.
 - **NUM_EPOCHS, LEARNING_RATE, BATCH_SIZE**: Control training duration, learning rate, and batch size for optimal learning.
 - **ACCUMULATION_STEPS**: Set accumulation steps to emulate larger batch sizes, managing memory usage efficiently.
 1. **Prepare Your Data**: bGPT can handle any computer file type, including text, images, audio, executables, and encrypted or proprietary formats, without needing specific adjustments for each modality. This capability allows for straightforward and versatile training on a wide array of data. The only thing you need to do here is simply to split your data for training and evaluation.
+2. **Adjust Configuration Settings**: Modify the `config.py` file to tailor the training process to your needs. At a minimum, you should update the `TRAIN_FOLDERS` and `EVAL_FOLDERS` to point to your actual data directories. Also, specify where to save the trained model weights and logs by setting `WEIGHTS_PATH` and `LOGS_PATH`. You may adjust other parameters based on your specific requirements. For instance, with the default `PATCH_SIZE=16` and `PATCH_LENGTH=512`, bGPT can model byte sequences up to 8KB. If your training files are larger, and you have sufficient computational resources, consider increasing these parameters to accommodate the larger file sizes.
+3. **Leverage Pre-trained Weights (Optional)**: If you wish to fine-tune a pre-trained bGPT model, set `PRE_WEIGHTS_PATH` to the location of the pre-trained weights and ensure `LOAD_FROM_PRE_CHECKPOINT=True`. To train a model from scratch, simply set `LOAD_FROM_PRE_CHECKPOINT=False`.
 4. **Start Training**: Run `train-gen.py` to begin the training process. The script will use the configurations set in `config.py` and apply the training data to learn generative models capable of producing new, unseen outputs in the format of your training data.
 ### Classification
 Classification with bGPT leverages the model's ability to understand and differentiate between various types of data at a fundamental level. This involves extracting a global feature from the byte sequence, which is then processed by a classification head. Here's how to approach classification tasks with bGPT:
 1. **Prepare Labelled Data**: Ensure your dataset consists of labelled data, which can be a mix of different formats. The model distinguishes between data types using the naming convention `label.ext`, where the label is derived from the filename, specifically `filename.split('_')[0]`. This means that the label for classification should be clearly reflected in the file name, such as "Business_1.txt". It is crucial to organize your files accordingly to facilitate accurate classification.
+2. **Generative Modelling Before Classification (Strongly Recommended)**: Before embarking on classification tasks, it is highly recommended to perform generative modelling on the same dataset. Starting with weights trained through generative modelling provides a solid foundation for further fine-tuning in classification tasks. To do this, set `PRE_WEIGHTS_PATH` to your generative model weights and ensure `LOAD_FROM_PRE_CHECKPOINT=True`. Directly training a classification model from scratch without this pre-training step has been observed to result in significantly poorer performance. When fine-tuning for classification, ensure that `WEIGHTS_PATH` and `LOGS_PATH` are set to different locations to prevent overwriting previous models. Note that the classification model will inherit the bGPT's patch-level decoder and discard the byte-level decoder, so it's essential to keep the model parameters unchanged during this phase.
 3. **Start Training for Classification**: Run `train-cls.py` to begin the classification training process. The script will utilize the previously set configurations and apply them to your labelled dataset. The model will learn to classify the input data into the defined categories based on the labels extracted from the filenames.
 1. **Set Up the Inference Configuration**: First, ensure your `config.py` includes the configuration for inference as shown above. Adjust the parameters according to the task you want to perform:
+   - `INFERENCE_WEIGHTS_PATH`: This should point to the location of your trained model weights that you intend to use for inference. For example, `weights-conversion` indicates the model trained for converting files from one format to another.
    - `INPUT_EXT` and `TARGET_EXT`: These parameters define the extensions of the input and target files, respectively. In the given configuration, the model expects input files with the `.mid` extension and will aim to convert them into files with the `.abc` extension.
+   - `MODE`: Determines the mode of inference. `convert` mode is used for converting files from one format to another, while `generate` mode is used for generating new content.
    - `NUM_SAMPLES`, `TOP_K`, `TOP_P`, and `TEMPERATURE`: These parameters control the sampling strategy for generation tasks, influencing the diversity and creativity of the output.
 2. **Performing Conversion or Generation**: Depending on the `MODE` you've set, the inference process will either convert input files to a new format or generate new content:
+   - **Conversion**: In `convert` mode, ensure your input files (e.g., `.mid`) are placed in a designated directory. The model will read these files, apply the conversion process, and output files in the target format (e.g., `.abc`) in the specified output directory.
+   - **Generation**: In `generate` mode, the model will generate samples from scratch. The number of samples to generate is specified by `NUM_SAMPLES`. The generated samples will be placed in the output directory.
 4. **Executing Inference**: To start the inference process, run the script `inference.py`. Make sure the script is configured to use the settings from `config.py` for conducting the desired inference task.
       archivePrefix={arXiv},
       primaryClass={cs.LG}
 }
+```