justinthelaw
/

LaMini-Flan-T5-783M-Opera-Fine-Tune

Model card Files Files and versions Community

LaMini-Flan-T5-783M-Opera-Fine-Tune / README.md

Justin Law

Docs(README): Added yaml metadata

73156f3 unverified about 1 year ago

preview code

raw

history blame

8.67 kB

	---
	language:
	- en
	license: apache-2.0
	tags:
	- united states air force
	- united states space force
	- department of defense
	- air force
	- space force
	- bullets
	- performance reports
	- OPR
	- EPR
	- narratives
	- interpreter
	- translation
	- t5
	- MBZUAI
	- LaMini-Flan-T5-783M
	- flan-t5
	- google
	- opera
	- justinthelaw
	---

	# Model Card for Opera Bullet Interpreter

	An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.

	This checkpoint is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.

	To learn more about this project, please visit the [Opera GitHub Repository](https://github.com/justinthelaw/opera).

	# Table of Contents

	- [Model Card for Opera Bullet Interpreter](#model-card-for--model_id-)
	- [Table of Contents](#table-of-contents)
	- [Model Details](#model-details)
	- [Uses](#uses)
	- [Bias, Risks, and Limitations](#bias-risks-and-limitations)
	- [Training Details](#training-details)
	- [Evaluation](#evaluation)
	- [Model Examination](#model-examination)
	- [Environmental Impact](#environmental-impact)
	- [Technical Specifications [optional]](#technical-specifications-optional)
	- [Citation](#citation)
	- [Model Card Authors](#model-card-authors-optional)
	- [Model Card Contact](#model-card-contact)
	- [How to Get Started with the Model](#how-to-get-started-with-the-model)

	# Model Details

	## Model Description

	An unofficial United States Air Force and Space Force performance statement "translation" model. Takes a properly formatted performance statement, also known as a "bullet," as an input and outputs a long-form sentence, using plain english, describing the accomplishments captured within the bullet.

	This is a fine-tuned version of the LaMini-Flan-T5-783M, using the justinthelaw/opera-bullet-completions (private) dataset.

	- Developed by: Justin Law, Alden Davidson, Christopher Kodama, My Tran
	- Model type: Language Model
	- Language(s) (NLP): en
	- License: apache-2.0
	- Parent Model: [LaMini-Flan-T5-783M](https://huggingface.co/MBZUAI/LaMini-Flan-T5-783M)
	- Resources for more information: More information needed
	- [GitHub Repo](https://github.com/justinthelaw/opera)
	- [Associated Paper](https://huggingface.co/MBZUAI/LaMini-Flan-T5-783M)

	# Uses

	## Direct Use

	Used to programmatically produce training data for Opera's Bullet Forge (see GitHub repository for details).

	## Downstream Use [Optional]

	Used to quickly interpret bullets written by Airman (Air Force) or Guardians (Space Force), into long-form, plain English sentences.

	## Out-of-Scope Use

	Generating bullets from long-form, plain English sentences. General NLP functionality.

	# Bias, Risks, and Limitations

	Specialized acronyms or abbreviations specific to small units may not be transformed properly. Bullets in highly non-standard formats may result in lower quality results.

	## Recommendations

	Look-up acronyms to ensure the correct narrative is being formed. Double-check (spot check) bullets with slightly more complex acronyms and abbreviations for narrative precision.

	# Training Details

	## Training Data

	pre-processing or additional filtering. -->

	The model was fine-tuned on the justinthelaw/opera-bullet-completions dataset, which can be partially found at the GitHub repository.

	## Training Procedure

	### Preprocessing

	The justinthelaw/opera-bullet-completions dataset was created using a custom Python web-scraper, along with some custom cleaning functions, all of which can be found at the GitHub repository.

	### Speeds, Sizes, Times

	It takes approximately 3-5 seconds per inference when using any standard-sized Air and Space Force bullet statement.

	# Evaluation

	## Testing Data, Factors & Metrics

	### Testing Data

	20% of the justinthelaw/opera-bullet-completions dataset was used to validate the model's performance.

	### Factors

	Repitition, contextual loss, and bullet format are all loss factors tied into the backward propogation calculations and validation steps.

	### Metrics

	ROGUE scores were computed and averaged. These may be provided in future iterations of this model's development.

	## Results

	# Model Examination

	More information needed

	# Environmental Impact

	- Hardware Type: 2019 MacBook Pro, 16 inch
	- Hours used: 18
	- Cloud Provider: N/A
	- Compute Region: N/A
	- Carbon Emitted: N/A

	# Technical Specifications

	### Hardware

	2.6 GHz 6-Core Intel Core i7, 16 GB 2667 MHz DDR4, AMD Radeon Pro 5300M 4 GB

	### Software

	VSCode, Jupyter Notebook, Python3, PyTorch, Transformers, Pandas, Asyncio, Loguru, Rich

	# Citation

	BibTeX:

	```
	@article{lamini-lm,
	author = {Minghao Wu and
	Abdul Waheed and
	Chiyu Zhang and
	Muhammad Abdul-Mageed and
	Alham Fikri Aji
	},
	title = {LaMini-LM: A Diverse Herd of Distilled Models from Large-Scale Instructions},
	journal = {CoRR},
	volume = {abs/2304.14402},
	year = {2023},
	url = {https://arxiv.org/abs/2304.14402},
	eprinttype = {arXiv},
	eprint = {2304.14402}
	}
	```

	# Model Card Authors

	construction? Etc. -->

	Justin Law, Alden Davidson, Christopher Kodama, My Tran

	# Model Card Contact

	Email: [email protected]

	# How to Get Started with the Model

	Use the code below to get started with the model.

	<details>
	<summary> Click to expand </summary>

	```python
	import torch
	from transformers import T5ForConditionalGeneration, T5Tokenizer

	bullet_data_creation_prefix = (
	"Using upwards of 3 sentences, expand upon the following Air and Space Force bullet statement by "
	+ "spelling-out acronyms and adding additional context that is not already included in the Air and Space Force bullet statement: "
	)

	# Path of the pre-trained model that will be used
	model_path = "justinthelaw/opera-bullet-interpreter"
	# Path of the pre-trained model tokenizer that will be used
	# Must match the model checkpoint's signature
	tokenizer_path = "justinthelaw/opera-bullet-interpreter"
	# Max length of tokens a user may enter for summarization
	# Increasing this beyond 512 may increase compute time significantly
	max_input_token_length = 512
	# Max length of tokens the model should output for the summary
	# Approximately the number of tokens it may take to generate a bullet
	max_output_token_length = 512
	# Beams to use for beam search algorithm
	# Increased beams means increased quality, but increased compute time
	number_of_beams = 6
	# Scales logits before soft-max to control randomness
	# Lower values (~0) make output more deterministic
	temperature = 0.5
	# Limits generated tokens to top K probabilities
	# Reduces chances of rare word predictions
	top_k = 50
	# Applies nucleus sampling, limiting token selection to a cumulative probability
	# Creates a balance between randomness and determinism
	top_p = 0.90

	try:
	tokenizer = T5Tokenizer.from_pretrained(
	f"{model_path}",
	model_max_length=max_input_token_length,
	add_special_tokens=False,
	)
	input_model = T5ForConditionalGeneration.from_pretrained(f"{model_path}")
	logger.info(f"Loading {model_path}...")
	# Set device to be used based on GPU availability
	device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
	# Model is sent to device for use
	model = input_model.to(device) # type: ignore

	input_text = bullet_data_creation_prefix + input("Input a US Air or Space Force bullet: ")

	encoded_input_text = tokenizer.encode_plus(
	input_text,
	return_tensors="pt",
	truncation=True,
	max_length=max_input_token_length,
	)

	# Generate summary
	summary_ids = model.generate(
	encoded_input_text["input_ids"],
	attention_mask=encoded_input_text["attention_mask"],
	max_length=max_output_token_length,
	num_beams=number_of_beams,
	temperature=temperature,
	top_k=top_k,
	top_p=top_p,
	early_stopping=True,
	)

	output_text = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

	# input_text and output_text insert into data sets
	print(input_line["output"] + "\n\t" + output_text)

	except KeyboardInterrupt:
	print("Received interrupt, stopping script...")
	except Exception as e:
	print(f"An error occurred during generation: {e}")
	```

	</details>