Model Card for Opera Bullet Interpreter

An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.

This checkpoint is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.

To learn more about this project, please visit the Opera GitHub Repository.

Model Card for Opera Bullet Interpreter
Table of Contents
Model Details
Uses
Bias, Risks, and Limitations
Training Details
Evaluation
Model Examination
Environmental Impact
Technical Specifications [optional]
Citation
Model Card Authors
Model Card Contact
How to Get Started with the Model

Model Details

Model Description

This is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.

Developed by: Justin Law, Alden Davidson, Christopher Kodama, My Tran
Model type: Language Model
Language(s) (NLP): en
License: apache-2.0
Parent Model: LaMini-Flan-T5-783M
Resources for more information: More information needed
- GitHub Repo
- Associated Paper

Uses

Direct Use

Used to programmatically produce training data for Opera's Bullet Forge (see GitHub repository for details).

Downstream Use [Optional]

Used to quickly interpret bullets written by Airman (Air Force) or Guardians (Space Force), into long-form, plain English sentences.

Out-of-Scope Use

Generating bullets from long-form, plain English sentences. General NLP functionality.

Bias, Risks, and Limitations

Specialized acronyms or abbreviations specific to small units may not be transformed properly. Bullets in highly non-standard formats may result in lower quality results.

Recommendations

Look-up acronyms to ensure the correct narrative is being formed. Double-check (spot check) bullets with slightly more complex acronyms and abbreviations for narrative precision.

Training Details

Training Data

pre-processing or additional filtering. -->

The model was fine-tuned on the justinthelaw/opera-bullet-completions dataset, which can be partially found at the GitHub repository.

Training Procedure

Preprocessing

The justinthelaw/opera-bullet-completions dataset was created using a custom Python web-scraper, along with some custom cleaning functions, all of which can be found at the GitHub repository.

Speeds, Sizes, Times

It takes approximately 3-5 seconds per inference when using any standard-sized Air and Space Force bullet statement.

Evaluation

Testing Data, Factors & Metrics

Testing Data

20% of the justinthelaw/opera-bullet-completions dataset was used to validate the model's performance.

Factors

Repitition, contextual loss, and bullet format are all loss factors tied into the backward propogation calculations and validation steps.

Metrics

ROGUE scores were computed and averaged. These may be provided in future iterations of this model's development.

Results

Model Examination

More information needed

Environmental Impact

Hardware Type: 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB
Hours used: 18
Cloud Provider: N/A
Compute Region: N/A
Carbon Emitted: N/A

Technical Specifications

Hardware

2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB

Software

VSCode, Jupyter Notebook, Python3, PyTorch, Transformers, Pandas, Asyncio, Loguru, Rich

Citation

BibTeX:

@article{lamini-lm, author = {Minghao Wu and Abdul Waheed and Chiyu Zhang and Muhammad Abdul-Mageed and Alham Fikri Aji }, title = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions}, journal = {CoRR}, volume = {abs/2304.14402}, year = {2023}, url = {https://arxiv.org/abs/2304.14402}, eprinttype = {arXiv}, eprint = {2304.14402} }

Model Card Authors

construction? Etc. -->

Justin Law, Alden Davidson, Christopher Kodama, My Tran

Model Card Contact

Email: [email protected]

How to Get Started with the Model

Use the code below to get started with the model.

Click to expand

import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

bullet_data_creation_prefix = (
    "Using upwards of 3 sentences, expand upon the following Air and Space Force bullet statement by "
    + "spelling-out acronyms and adding additional context that is not already included in the Air and Space Force bullet statement: "
)

# Path of the pre-trained model that will be used
model_path = "justinthelaw/opera-bullet-interpreter"
# Path of the pre-trained model tokenizer that will be used
# Must match the model checkpoint's signature
tokenizer_path = "justinthelaw/opera-bullet-interpreter"
# Max length of tokens a user may enter for summarization
# Increasing this beyond 512 may increase compute time significantly
max_input_token_length = 512
# Max length of tokens the model should output for the summary
# Approximately the number of tokens it may take to generate a bullet
max_output_token_length = 512
# Beams to use for beam search algorithm
# Increased beams means increased quality, but increased compute time
number_of_beams = 6
# Scales logits before soft-max to control randomness
# Lower values (~0) make output more deterministic
temperature = 0.5
# Limits generated tokens to top K probabilities
# Reduces chances of rare word predictions
top_k = 50
# Applies nucleus sampling, limiting token selection to a cumulative probability
# Creates a balance between randomness and determinism
top_p = 0.90

try:
    tokenizer = T5Tokenizer.from_pretrained(
        f"{model_path}",
        model_max_length=max_input_token_length,
        add_special_tokens=False,
    )
    input_model = T5ForConditionalGeneration.from_pretrained(f"{model_path}")
    logger.info(f"Loading {model_path}...")
    # Set device to be used based on GPU availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Model is sent to device for use
    model = input_model.to(device)  # type: ignore

    input_text = bullet_data_creation_prefix + input("Input a US Air or Space Force bullet: ")

    encoded_input_text = tokenizer.encode_plus(
        input_text,
        return_tensors="pt",
        truncation=True,
        max_length=max_input_token_length,
    )

    # Generate summary
    summary_ids = model.generate(
        encoded_input_text["input_ids"],
        attention_mask=encoded_input_text["attention_mask"],
        max_length=max_output_token_length,
        num_beams=number_of_beams,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        early_stopping=True,
    )

    output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # input_text and output_text insert into data sets
    print(input_line["output"] + "\n\t" + output_text)

except KeyboardInterrupt:
    print("Received interrupt, stopping script...")
except Exception as e:
    print(f"An error occurred during generation: {e}")