hishab
/

titulm-gemma-2-2b-v1.1

Text Generation

text-generation-inference

Inference Endpoints

Model card Files Files and versions Community

SarwarShafee commited on 24 days ago

Commit

1318029

•

1 Parent(s): 72bea82

Update README.md

Files changed (1) hide show

README.md +3 -3

README.md CHANGED Viewed

@@ -62,14 +62,14 @@ print(response)
 **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __33GB__ data from that using a ratio of the actual data size. Total trained tokens are __4.4B__ tokens.
 Data sources summary:
-- Web documents: Extracted, clean, and filtered common crawl data
-- Books: Extracted, clean, filtered books data
 - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
 - Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
 - Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
 - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
 - Synthetic data: We generated synthetic data using a Bangla LLM model
-- Others: We scrapped some selected website data, used open-source data, and used some other data sources
 ## Benchmarks

 **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __33GB__ data from that using a ratio of the actual data size. Total trained tokens are __4.4B__ tokens.
 Data sources summary:
+- Web documents: Extracted, cleaned, and filtered common crawl data
+- Books: Extracted, cleaned, filtered books data
 - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
 - Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
 - Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
 - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
 - Synthetic data: We generated synthetic data using a Bangla LLM model
+- Others: We scrapped data from some selected websites, used open-source data, and used some other data sources
 ## Benchmarks