hishab
/

titulm-gemma-2-2b-v1.1

@@ -54,7 +54,7 @@ print(response)
 ## Hardware and Software
-**Training Factors:** We used the [llama-factory]() training library, a cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
 ## Training Data
@@ -96,7 +96,7 @@ We evaluated the models on the following datasets:
 ### Evaluation Results
-#### Evaluation on Bangla Benchmark datasets
 - **gemma-2-2b** shows stronger performance in **Bangla MMLU** and **BoolQ BN** in the 0-shot setting.
 - **titulm-gemma-2-2b-v1.1** performs better in **Commonsense QA BN**, **OpenBook QA BN**, and **PIQA BN** across both 0-shot and 5-shot settings.
 - In the 5-shot setting, **titulm-gemma-2-2b-v1.1** leads in **BoolQ BN**, **Commonsense QA BN**, and **OpenBook QA BN**.
@@ -109,9 +109,10 @@ We evaluated the models on the following datasets:
 | titulm-gemma-2-2b-v1.1   | 0-shot  | 0.30        | 0.61     | **0.31**          | **0.35**       | **0.62**|
 |                          | 5-shot  | 0.35        | **0.57** | **0.40**          | **0.38**       | 0.60    |
-#### Evaluation on English Benchmark datasets
 - **gemma-2-2b** consistently achieves the highest scores across all tasks in both 0-shot and 5-shot settings, leading in **MMLU**, **BoolQ**, **Commonsense QA**, **OpenBook QA**, and **PIQA**, with a maximum 5-shot score of **0.80** in **PIQA**.
 - **titulm-gemma-2-2b-v1.1** performs well but trails behind **gemma-2-2b**, particularly in **Commonsense QA** and **OpenBook QA**, with the best scores being slightly lower across all tasks.
 | Model                                | Shots  | MMLU         | BoolQ      | Commonsense QA     | OpenBook QA     | PIQA      |
 |--------------------------------------|--------|--------------|------------|--------------------|-----------------|-----------|

 ## Hardware and Software
+**Training Factors:** We used the [llama-factory](https://github.com/hiyouga/LLaMA-Factory) training library, a cloud GPU cluster, and production infrastructure for pretraining. Fine-tuning, annotation, and evaluation were also performed on cloud infrastructure.
 ## Training Data
 ### Evaluation Results
+#### Evaluation of Bangla Benchmark datasets
 - **gemma-2-2b** shows stronger performance in **Bangla MMLU** and **BoolQ BN** in the 0-shot setting.
 - **titulm-gemma-2-2b-v1.1** performs better in **Commonsense QA BN**, **OpenBook QA BN**, and **PIQA BN** across both 0-shot and 5-shot settings.
 - In the 5-shot setting, **titulm-gemma-2-2b-v1.1** leads in **BoolQ BN**, **Commonsense QA BN**, and **OpenBook QA BN**.
 | titulm-gemma-2-2b-v1.1   | 0-shot  | 0.30        | 0.61     | **0.31**          | **0.35**       | **0.62**|
 |                          | 5-shot  | 0.35        | **0.57** | **0.40**          | **0.38**       | 0.60    |
+#### Evaluation of English Benchmark datasets
 - **gemma-2-2b** consistently achieves the highest scores across all tasks in both 0-shot and 5-shot settings, leading in **MMLU**, **BoolQ**, **Commonsense QA**, **OpenBook QA**, and **PIQA**, with a maximum 5-shot score of **0.80** in **PIQA**.
 - **titulm-gemma-2-2b-v1.1** performs well but trails behind **gemma-2-2b**, particularly in **Commonsense QA** and **OpenBook QA**, with the best scores being slightly lower across all tasks.
+- It is expected as we have trained only on Bangla text.
 | Model                                | Shots  | MMLU         | BoolQ      | Commonsense QA     | OpenBook QA     | PIQA      |
 |--------------------------------------|--------|--------------|------------|--------------------|-----------------|-----------|