Justin Law
Build(Release): v0.1.0 Opera Bullet Interpreter Model
b5dbfe2 unverified
|
raw
history blame
8.17 kB
# Model Card for Opera Bullet Interpreter
An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.
This checkpoint is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.
To learn more about this project, please visit the [Opera GitHub Repository](https://github.com/justinthelaw/opera).
# Table of Contents
- [Model Card for Opera Bullet Interpreter](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Model Examination](#model-examination)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications [optional]](#technical-specifications-optional)
- [Citation](#citation)
- [Model Card Authors](#model-card-authors-optional)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)
# Model Details
## Model Description
An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.
This is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.
- **Developed by:** Justin Law, Alden Davidson, Christopher Kodama, My Tran
- **Model type:** Language Model
- **Language(s) (NLP):** en
- **License:** apache-2.0
- **Parent Model:** [LaMini-Flan-T5-783M](https://huggingface.co/MBZUAI/LaMini-Flan-T5-783M)
- **Resources for more information:** More information needed
- [GitHub Repo](https://github.com/justinthelaw/opera)
- [Associated Paper](https://huggingface.co/MBZUAI/LaMini-Flan-T5-783M)
# Uses
## Direct Use
Used to programmatically produce training data for Opera's Bullet Forge (see GitHub repository for details).
## Downstream Use [Optional]
Used to quickly interpret bullets written by Airman (Air Force) or Guardians (Space Force), into long-form, plain English sentences.
## Out-of-Scope Use
Generating bullets from long-form, plain English sentences. General NLP functionality.
# Bias, Risks, and Limitations
Specialized acronyms or abbreviations specific to small units may not be transformed properly. Bullets in highly non-standard formats may result in lower quality results.
## Recommendations
Look-up acronyms to ensure the correct narrative is being formed. Double-check (spot check) bullets with slightly more complex acronyms and abbreviations for narrative precision.
# Training Details
## Training Data
pre-processing or additional filtering. -->
The model was fine-tuned on the justinthelaw/opera-bullet-completions dataset, which can be partially found at the GitHub repository.
## Training Procedure
### Preprocessing
The justinthelaw/opera-bullet-completions dataset was created using a custom Python web-scraper, along with some custom cleaning functions, all of which can be found at the GitHub repository.
### Speeds, Sizes, Times
It takes approximately 3-5 seconds per inference when using any standard-sized Air and Space Force bullet statement.
# Evaluation
## Testing Data, Factors & Metrics
### Testing Data
20% of the justinthelaw/opera-bullet-completions dataset was used to validate the model's performance.
### Factors
Repitition, contextual loss, and bullet format are all loss factors tied into the backward propogation calculations and validation steps.
### Metrics
ROGUE scores were computed and averaged. These may be provided in future iterations of this model's development.
## Results
# Model Examination
More information needed
# Environmental Impact
- **Hardware Type:** 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB
- **Hours used:** 18
- **Cloud Provider:** N/A
- **Compute Region:** N/A
- **Carbon Emitted:** N/A
# Technical Specifications
### Hardware
2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB
### Software
VSCode, Jupyter Notebook, Python3, PyTorch, Transformers, Pandas, Asyncio, Loguru, Rich
# Citation
**BibTeX:**
@article{lamini-lm,
author = {Minghao Wu and
Abdul Waheed and
Chiyu Zhang and
Muhammad Abdul-Mageed and
Alham Fikri Aji
},
title = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
journal = {CoRR},
volume = {abs/2304.14402},
year = {2023},
url = {https://arxiv.org/abs/2304.14402},
eprinttype = {arXiv},
eprint = {2304.14402}
}
# Model Card Authors
construction? Etc. -->
Justin Law, Alden Davidson, Christopher Kodama, My Tran
# Model Card Contact
Email: [email protected]
# How to Get Started with the Model
Use the code below to get started with the model.
<details>
<summary> Click to expand </summary>
```python
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer
bullet_data_creation_prefix = (
"Using upwards of 3 sentences, expand upon the following Air and Space Force bullet statement by "
+ "spelling-out acronyms and adding additional context that is not already included in the Air and Space Force bullet statement: "
)
# Path of the pre-trained model that will be used
model_path = "justinthelaw/opera-bullet-interpreter"
# Path of the pre-trained model tokenizer that will be used
# Must match the model checkpoint's signature
tokenizer_path = "justinthelaw/opera-bullet-interpreter"
# Max length of tokens a user may enter for summarization
# Increasing this beyond 512 may increase compute time significantly
max_input_token_length = 512
# Max length of tokens the model should output for the summary
# Approximately the number of tokens it may take to generate a bullet
max_output_token_length = 512
# Beams to use for beam search algorithm
# Increased beams means increased quality, but increased compute time
number_of_beams = 6
# Scales logits before soft-max to control randomness
# Lower values (~0) make output more deterministic
temperature = 0.5
# Limits generated tokens to top K probabilities
# Reduces chances of rare word predictions
top_k = 50
# Applies nucleus sampling, limiting token selection to a cumulative probability
# Creates a balance between randomness and determinism
top_p = 0.90
try:
tokenizer = T5Tokenizer.from_pretrained(
f"{model_path}",
model_max_length=max_input_token_length,
add_special_tokens=False,
)
input_model = T5ForConditionalGeneration.from_pretrained(f"{model_path}")
logger.info(f"Loading {model_path}...")
# Set device to be used based on GPU availability
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
# Model is sent to device for use
model = input_model.to(device) # type: ignore
input_text = bullet_data_creation_prefix + input("Input a US Air or Space Force bullet: ")
encoded_input_text = tokenizer.encode_plus(
input_text,
return_tensors="pt",
truncation=True,
max_length=max_input_token_length,
)
# Generate summary
summary_ids = model.generate(
encoded_input_text["input_ids"],
attention_mask=encoded_input_text["attention_mask"],
max_length=max_output_token_length,
num_beams=number_of_beams,
temperature=temperature,
top_k=top_k,
top_p=top_p,
early_stopping=True,
)
output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
# input_text and output_text insert into data sets
print(input_line["output"] + "\n\t" + output_text)
except KeyboardInterrupt:
print("Received interrupt, stopping script...")
except Exception as e:
print(f"An error occurred during generation: {e}")
```
</details>