--- license: apache-2.0 datasets: - imdb language: - en metrics: - f1 - accuracy - recall - precision library_name: peft pipeline_tag: text-classification --- # A Finetuned Bloom 1b1 Model for Sequence Classification The model was developed as a personal learning experience to fine tune a ready language model for Text Classification and to use it on real life data from the internet to perform sentiment analysis. It has been generated using [this raw template](https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/templates/modelcard_template.md?plain=1). ## Model Details The model achieves the following scores on the evaluation set during the fine tuning: ![Screenshot 2024-01-03 at 16.08.46.png](https://cdn-uploads.huggingface.co/production/uploads/64857e2b745fb671250a5beb/26EB2jJDKI0gsnvjHA9WP.png) Here is the train/ eval/ test split: ``` DatasetDict({ train: Dataset({ features: ['review', 'sentiment'], num_rows: 36000 }) test: Dataset({ features: ['review', 'sentiment'], num_rows: 5000 }) eval: Dataset({ features: ['review', 'sentiment'], num_rows: 9000 }) }) ``` ### Model Description - **Developed by:** Snoop088 - **Model type:** Text Classification / Sequence Classification - **Language(s) (NLP):** English - **License:** Apache 2.0 - **Finetuned from model: bigscience/bloom-1b1 ### Model Sources [optional] - **Repository:** https://huggingface.co/snoop088/imdb_tuned-bloom1b1-sentiment-classifier/tree/main - **Paper [optional]:** [More Information Needed] - **Demo [optional]:** [More Information Needed] ## Uses The model is intended to be used for Text Classification. ### Direct Use Example script to use the model. Please note that this is peft adapter on the Bloom 1b model: ``` DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu" model_name = 'snoop088/imdb_tuned-bloom1b1-sentiment-classifier' loaded_model = AutoModelForSequenceClassification.from_pretrained(model_name, trust_remote_code=True, num_labels=2, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) tokenizer.pad_token = tokenizer.eos_token my_set = pd.read_csv("./data/df_manual.csv") inputs = tokenizer(list(my_set["review"]), truncation=True, padding="max_length", max_length=256, return_tensors="pt").to(DEVICE) outputs = loaded_model(**inputs) outcome = np.argmax(torch.Tensor.cpu(outputs.logits), axis=-1) ``` [More Information Needed] ### Downstream Use [optional] The purpose of this model is to be used to perform sentiment analysis on a dataset similar to the one by IMDB. It should work well on product reviews, too in my opinion. [More Information Needed] ### Out-of-Scope Use [More Information Needed] ## Bias, Risks, and Limitations [More Information Needed] ### Recommendations Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations. ## How to Get Started with the Model Use the code below to get started with the model. [More Information Needed] ## Training Details ### Training Data Training is done on the IMDB dataset available on the Hub: [imdb](https://huggingface.co/datasets/imdb) [More Information Needed] ### Training Procedure ``` training_arguments = TrainingArguments( output_dir="your_tuned_model_name", save_strategy="epoch", per_device_train_batch_size=4, per_device_eval_batch_size=4, gradient_accumulation_steps=4, optim="adamw_torch", evaluation_strategy="steps", logging_steps=5, learning_rate=1e-5, max_grad_norm = 0.3, eval_steps=0.2, num_train_epochs=2, warmup_ratio= 0.1, # group_by_length=True, fp16=False, weight_decay=0.001, lr_scheduler_type="constant", ) peft_model = get_peft_model(model, LoraConfig( task_type="SEQ_CLS", r=16, lora_alpha=16, target_modules=[ 'query_key_value', 'dense' ], bias="none", lora_dropout=0.05, # Conventional )) ``` LORA results in: trainable params: 3,542,016 || all params: 1,068,859,392 || trainable%: 0.3313827830405592 #### Preprocessing [optional] Simple preprocessing with DataCollator: ``` def process_data(example): item = tokenizer(example["review"], truncation=True, max_length=320) # see if this is OK for dyn padding item["labels"] = [ 1 if sent == 'positive' else 0 for sent in example["sentiment"]] return item tokenised_data = tokenised_data.remove_columns(["review", "sentiment"]) data_collator = DataCollatorWithPadding(tokenizer=tokenizer) ``` #### Training Hyperparameters - **Training regime:** [More Information Needed] #### Speeds, Sizes, Times [optional] [More Information Needed] ## Evaluation Evaluation function: ``` import evaluate def compute_metrics(eval_pred): # All metrics are already predefined in the HF `evaluate` package precision_metric = evaluate.load("precision") recall_metric = evaluate.load("recall") f1_metric= evaluate.load("f1") accuracy_metric = evaluate.load("accuracy") logits, labels = eval_pred # eval_pred is the tuple of predictions and labels returned by the model predictions = np.argmax(logits, axis=-1) precision = precision_metric.compute(predictions=predictions, references=labels)["precision"] recall = recall_metric.compute(predictions=predictions, references=labels)["recall"] f1 = f1_metric.compute(predictions=predictions, references=labels)["f1"] accuracy = accuracy_metric.compute(predictions=predictions, references=labels)["accuracy"] # The trainer is expecting a dictionary where the keys are the metrics names and the values are the scores. return {"precision": precision, "recall": recall, "f1-score": f1, 'accuracy': accuracy} ``` ### Testing Data, Factors & Metrics #### Testing Data [More Information Needed] #### Factors [More Information Needed] #### Metrics [More Information Needed] ### Results [More Information Needed] #### Summary ## Model Examination [optional] [More Information Needed] ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** [More Information Needed] - **Hours used:** [More Information Needed] - **Cloud Provider:** [More Information Needed] - **Compute Region:** [More Information Needed] - **Carbon Emitted:** [More Information Needed] ## Technical Specifications [optional] ### Model Architecture and Objective [More Information Needed] ### Compute Infrastructure [More Information Needed] #### Hardware - Model: 6.183.1 "13th Gen Intel(R) Core(TM) i9-13900K" - GPU: Nvidia RTX 4900/ 24 GB - Memory: 64 GB #### Software - python 3.11.6 - transformers 4.36.2 - torch 2.1.2 - peft 0.7.1 - numpy 1.26.2 - datasets 2.16.0 ## Citation [optional] **BibTeX:** [More Information Needed] **APA:** [More Information Needed] ## Glossary [optional] [More Information Needed] ## More Information [optional] [More Information Needed] ## Model Card Authors [optional] [More Information Needed] ## Model Card Contact [More Information Needed]