shahidul034
/

KUET_LLM_Mistral

@@ -3,73 +3,64 @@ library_name: transformers
 tags: []
 ---
-# Model Card for Model ID
-<!-- Provide a quick summary of what the model is/does. -->
 ## Model Details
 ### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
-<!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
-## Uses
-<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
-<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 ## How to Get Started with the Model
-Use the code below to get started with the model.
 [More Information Needed]
@@ -77,13 +68,132 @@ Use the code below to get started with the model.
 ### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
 ### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 #### Preprocessing [optional]
@@ -92,7 +202,17 @@ Use the code below to get started with the model.
 #### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 #### Speeds, Sizes, Times [optional]
@@ -108,7 +228,7 @@ Use the code below to get started with the model.
 #### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
 [More Information Needed]
@@ -128,15 +248,6 @@ Use the code below to get started with the model.
 [More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
 ## Environmental Impact
@@ -144,58 +255,17 @@ Use the code below to get started with the model.
 Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
 #### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
-<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
-**BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 tags: []
 ---
 ## Model Details
 ### Model Description
+This model is created for answering the KUET(Khulna University of Engineering & Technology) information.
+- **Developed by:** Md. Shahidul Salim
+- **Model type:** Question answering
+- **Language(s) (NLP):** English
+- **Finetuned from model:** mistralai/Mistral-7B-Instruct-v0.1
 ## How to Get Started with the Model
+```
+import transformers
+from transformers import AutoTokenizer
+model_name="shahidul034/KUET_LLM_Mistral"
+tokenizer = AutoTokenizer.from_pretrained(model_name)
+model = transformers.AutoModelForCausalLM.from_pretrained(model_name)
+pipe = pipeline("text-generation",
+                model=full_output,
+                tokenizer= tokenizer,
+                torch_dtype=torch.bfloat16,
+                device_map="auto",
+                max_new_tokens = 512,
+                do_sample=True,
+                top_k=30,
+                num_return_sequences=1,
+                eos_token_id=tokenizer.eos_token_id
+                )
+from langchain import HuggingFacePipeline
+llm = HuggingFacePipeline(pipeline = pipe, model_kwargs = {'temperature':0})
+from langchain.llms import HuggingFaceTextGenInference
+from langchain.llms import HuggingFaceTextGenInference
+from langchain import PromptTemplate
+from langchain.schema import StrOutputParser
+template = """
+    <s>[INST] <<SYS>>
+    {role}
+    <</SYS>>
+    {text} [/INST]
+"""
+prompt = PromptTemplate(
+    input_variables = [
+        "role",
+        "text"
+    ],
+    template = template,
+)
+role = "You are a KUET authority managed chatbot, help users by answering their queries about KUET."
+chain = prompt | llm | StrOutputParser()
+ques="What is KUET?"
+ans=chain.invoke({"role": role,"text":ques})
+print(ans)
+```
 [More Information Needed]
 ### Training Data
+Custom dataset for collecting from KUET website.
 ### Training Procedure
+```
+import os
+import torch
+from datasets import load_dataset, Dataset
+import pandas as pd
+import transformers
+from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
+from trl import SFTTrainer
+import transformers
+# from peft import AutoPeftModelForCausalLM
+from transformers import GenerationConfig
+from pynvml import *
+import glob
+base_model = "mistralai/Mistral-7B-Instruct-v0.2"
+lora_output = 'models/lora_KUET_LLM_Mistral'
+full_output = 'models/full_KUET_LLM_Mistral'
+DEVICE = 'cuda'
+bnb_config = BitsAndBytesConfig(
+    load_in_8bit= True,
+#     bnb_4bit_quant_type= "nf4",
+#     bnb_4bit_compute_dtype= torch.bfloat16,
+#     bnb_4bit_use_double_quant= False,
+)
+model = AutoModelForCausalLM.from_pretrained(
+        base_model,
+        # load_in_4bit=True,
+        quantization_config=bnb_config,
+        torch_dtype=torch.bfloat16,
+        device_map="auto",
+        trust_remote_code=True,
+)
+model.config.use_cache = False # silence the warnings
+model.config.pretraining_tp = 1
+model.gradient_checkpointing_enable()
+tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
+tokenizer.padding_side = 'right'
+tokenizer.pad_token = tokenizer.eos_token
+tokenizer.add_eos_token = True
+tokenizer.add_bos_token, tokenizer.add_eos_token
+### read csv with Prompt, Answer pair
+data_location = r"/home/sdm/Desktop/shakib/KUET LLM/data/dataset_shakibV2.xlsx" ## replace here
+data_df=pd.read_excel( data_location )
+def formatted_text(x):
+    temp = [
+    # {"role": "system", "content": "Answer as a medical assistant. Respond concisely."},
+    {"role": "user", "content": """Answer the question concisely as a medical assisstant.
+     Question: """ + x["Prompt"]},
+    {"role": "assistant", "content": x["Reply"]}
+    ]
+    return tokenizer.apply_chat_template(temp, add_generation_prompt=False, tokenize=False)
+### set formatting
+data_df["text"] = data_df[["Prompt", "Reply"]].apply(lambda x: formatted_text(x), axis=1) ## replace Prompt and Answer if collected dataset has different column names
+print(data_df.iloc[0])
+dataset = Dataset.from_pandas(data_df)
+# Set PEFT adapter config (16:32)
+from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
+# target modules are currently selected for zephyr base model
+config = LoraConfig(
+    r=16,
+    lora_alpha=32,
+    target_modules=["q_proj", "v_proj","k_proj","o_proj","gate_proj","up_proj","down_proj"],   # target all the linear layers for full finetuning
+    lora_dropout=0.05,
+    bias="none",
+    task_type="CAUSAL_LM")
+# stabilize output layer and layernorms
+model = prepare_model_for_kbit_training(model, 8)
+# Set PEFT adapter on model (Last step)
+model = get_peft_model(model, config)
+# Set Hyperparameters
+MAXLEN=512
+BATCH_SIZE=4
+GRAD_ACC=4
+OPTIMIZER='paged_adamw_8bit' # save memory
+LR=5e-06                      # slightly smaller than pretraining lr | and close to LoRA standard
+# Set training config
+training_config = transformers.TrainingArguments(per_device_train_batch_size=BATCH_SIZE,
+                                                 gradient_accumulation_steps=GRAD_ACC,
+                                                 optim=OPTIMIZER,
+                                                 learning_rate=LR,
+                                                 fp16=True,            # consider compatibility when using bf16
+                                                 logging_steps=10,
+                                                 num_train_epochs = 2,
+                                                 output_dir=lora_output,
+                                                 remove_unused_columns=True,
+                                                 )
+# Set collator
+data_collator = transformers.DataCollatorForLanguageModeling(tokenizer,mlm=False)
+# Setup trainer
+trainer = SFTTrainer(model=model,
+                               train_dataset=dataset,
+                               data_collator=data_collator,
+                               args=training_config,
+                               dataset_text_field="text",
+                            #    callbacks=[early_stop], need to learn, lora easily overfits
+                              )
+trainer.train()
+trainer.save_model(lora_output)
+# Get peft config
+from peft import PeftConfig
+config = PeftConfig.from_pretrained(lora_output)
+# Get base model
+model = transformers.AutoModelForCausalLM.from_pretrained(config.base_model_name_or_path)
+tokenizer = transformers.AutoTokenizer.from_pretrained(base_model)
+# Load the Lora model
+from peft import PeftModel
+model = PeftModel.from_pretrained(model, lora_output)
+# Get tokenizer
+tokenizer = transformers.AutoTokenizer.from_pretrained(config.base_model_name_or_path)
+merged_model = model.merge_and_unload()
+merged_model.save_pretrained(full_output)
+tokenizer.save_pretrained(full_output)
+```
 #### Preprocessing [optional]
 #### Training Hyperparameters
+- The following hyperparameters were used during training:
+- learning_rate: 0.0002
+- train_batch_size: 24
+- eval_batch_size: 8
+- seed: 42
+- gradient_accumulation_steps: 4
+- total_train_batch_size: 96
+- optimizer: Adam with betas=(0.9,0.999) and epsilon=1e-08
+- lr_scheduler_type: linear
+- num_epochs: 2
+- mixed_precision_training: Native AMP
 #### Speeds, Sizes, Times [optional]
 #### Testing Data
+194 questions are generated by students.
 [More Information Needed]
 [More Information Needed]
 ## Environmental Impact
 Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hours used:** 2 hours
 #### Hardware
+RTX 4090