SarwarShafee commited on
Commit
1318029
1 Parent(s): 72bea82

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -62,14 +62,14 @@ print(response)
62
  **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __33GB__ data from that using a ratio of the actual data size. Total trained tokens are __4.4B__ tokens.
63
 
64
  Data sources summary:
65
- - Web documents: Extracted, clean, and filtered common crawl data
66
- - Books: Extracted, clean, filtered books data
67
  - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
68
  - Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
69
  - Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
70
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
71
  - Synthetic data: We generated synthetic data using a Bangla LLM model
72
- - Others: We scrapped some selected website data, used open-source data, and used some other data sources
73
 
74
 
75
  ## Benchmarks
 
62
  **Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __33GB__ data from that using a ratio of the actual data size. Total trained tokens are __4.4B__ tokens.
63
 
64
  Data sources summary:
65
+ - Web documents: Extracted, cleaned, and filtered common crawl data
66
+ - Books: Extracted, cleaned, filtered books data
67
  - Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
68
  - Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
69
  - Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
70
  - Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
71
  - Synthetic data: We generated synthetic data using a Bangla LLM model
72
+ - Others: We scrapped data from some selected websites, used open-source data, and used some other data sources
73
 
74
 
75
  ## Benchmarks