File size: 8,165 Bytes

b5dbfe2

# Model Card for Opera Bullet Interpreter

An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.

This checkpoint is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.

To learn more about this project, please visit the [Opera GitHub Repository](https://github.com/justinthelaw/opera).

# Table of Contents

- [Model Card for Opera Bullet Interpreter](#model-card-for--model_id-)
- [Table of Contents](#table-of-contents)
- [Model Details](#model-details)
- [Uses](#uses)
- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
- [Training Details](#training-details)
- [Evaluation](#evaluation)
- [Model Examination](#model-examination)
- [Environmental Impact](#environmental-impact)
- [Technical Specifications [optional]](#technical-specifications-optional)
- [Citation](#citation)
- [Model Card Authors](#model-card-authors-optional)
- [Model Card Contact](#model-card-contact)
- [How to Get Started with the Model](#how-to-get-started-with-the-model)

# Model Details

## Model Description

An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.

This is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.

- **Developed by:** Justin Law, Alden Davidson, Christopher Kodama, My Tran
- **Model type:** Language Model
- **Language(s) (NLP):** en
- **License:** apache-2.0
- **Parent Model:** [LaMini-Flan-T5-783M](https://huggingface.co/MBZUAI/LaMini-Flan-T5-783M)
- **Resources for more information:** More information needed
  - [GitHub Repo](https://github.com/justinthelaw/opera)
  - [Associated Paper](https://huggingface.co/MBZUAI/LaMini-Flan-T5-783M)

# Uses

## Direct Use

Used to programmatically produce training data for Opera&#39;s Bullet Forge (see GitHub repository for details).

## Downstream Use [Optional]

Used to quickly interpret bullets written by Airman (Air Force) or Guardians (Space Force), into long-form, plain English sentences.

## Out-of-Scope Use

Generating bullets from long-form, plain English sentences. General NLP functionality.

# Bias, Risks, and Limitations

Specialized acronyms or abbreviations specific to small units may not be transformed properly. Bullets in highly non-standard formats may result in lower quality results.

## Recommendations

Look-up acronyms to ensure the correct narrative is being formed. Double-check (spot check) bullets with slightly more complex acronyms and abbreviations for narrative precision.

# Training Details

## Training Data

pre-processing or additional filtering. -->

The model was fine-tuned on the justinthelaw/opera-bullet-completions dataset, which can be partially found at the GitHub repository.

## Training Procedure

### Preprocessing

The justinthelaw/opera-bullet-completions dataset was created using a custom Python web-scraper, along with some custom cleaning functions, all of which can be found at the GitHub repository.

### Speeds, Sizes, Times

It takes approximately 3-5 seconds per inference when using any standard-sized Air and Space Force bullet statement.

# Evaluation

## Testing Data, Factors & Metrics

### Testing Data

20% of the justinthelaw/opera-bullet-completions dataset was used to validate the model's performance.

### Factors

Repitition, contextual loss, and bullet format are all loss factors tied into the backward propogation calculations and validation steps.

### Metrics

ROGUE scores were computed and averaged. These may be provided in future iterations of this model's development.

## Results

# Model Examination

More information needed

# Environmental Impact

- **Hardware Type:** 2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB
- **Hours used:** 18
- **Cloud Provider:** N/A
- **Compute Region:** N/A
- **Carbon Emitted:** N/A

# Technical Specifications

### Hardware

2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB

### Software

VSCode, Jupyter Notebook, Python3, PyTorch, Transformers, Pandas, Asyncio, Loguru, Rich

# Citation

**BibTeX:**

@article{lamini-lm,
author = {Minghao Wu and
Abdul Waheed and
Chiyu Zhang and
Muhammad Abdul-Mageed and
Alham Fikri Aji
},
title = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
journal = {CoRR},
volume = {abs/2304.14402},
year = {2023},
url = {https://arxiv.org/abs/2304.14402},
eprinttype = {arXiv},
eprint = {2304.14402}
}

# Model Card Authors

construction? Etc. -->

Justin Law, Alden Davidson, Christopher Kodama, My Tran

# Model Card Contact

Email: [email protected]

# How to Get Started with the Model

Use the code below to get started with the model.

<details>
<summary> Click to expand </summary>

```python
import torch
from transformers import T5ForConditionalGeneration, T5Tokenizer

bullet_data_creation_prefix = (
    "Using upwards of 3 sentences, expand upon the following Air and Space Force bullet statement by "
    + "spelling-out acronyms and adding additional context that is not already included in the Air and Space Force bullet statement: "
)

# Path of the pre-trained model that will be used
model_path = "justinthelaw/opera-bullet-interpreter"
# Path of the pre-trained model tokenizer that will be used
# Must match the model checkpoint's signature
tokenizer_path = "justinthelaw/opera-bullet-interpreter"
# Max length of tokens a user may enter for summarization
# Increasing this beyond 512 may increase compute time significantly
max_input_token_length = 512
# Max length of tokens the model should output for the summary
# Approximately the number of tokens it may take to generate a bullet
max_output_token_length = 512
# Beams to use for beam search algorithm
# Increased beams means increased quality, but increased compute time
number_of_beams = 6
# Scales logits before soft-max to control randomness
# Lower values (~0) make output more deterministic
temperature = 0.5
# Limits generated tokens to top K probabilities
# Reduces chances of rare word predictions
top_k = 50
# Applies nucleus sampling, limiting token selection to a cumulative probability
# Creates a balance between randomness and determinism
top_p = 0.90

try:
    tokenizer = T5Tokenizer.from_pretrained(
        f"{model_path}",
        model_max_length=max_input_token_length,
        add_special_tokens=False,
    )
    input_model = T5ForConditionalGeneration.from_pretrained(f"{model_path}")
    logger.info(f"Loading {model_path}...")
    # Set device to be used based on GPU availability
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    # Model is sent to device for use
    model = input_model.to(device)  # type: ignore

    input_text = bullet_data_creation_prefix + input("Input a US Air or Space Force bullet: ")

    encoded_input_text = tokenizer.encode_plus(
        input_text,
        return_tensors="pt",
        truncation=True,
        max_length=max_input_token_length,
    )

    # Generate summary
    summary_ids = model.generate(
        encoded_input_text["input_ids"],
        attention_mask=encoded_input_text["attention_mask"],
        max_length=max_output_token_length,
        num_beams=number_of_beams,
        temperature=temperature,
        top_k=top_k,
        top_p=top_p,
        early_stopping=True,
    )

    output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

    # input_text and output_text insert into data sets
    print(input_line["output"] + "\n\t" + output_text)

except KeyboardInterrupt:
    print("Received interrupt, stopping script...")
except Exception as e:
    print(f"An error occurred during generation: {e}")
```

</details>