SarwarShafee
commited on
Commit
•
1318029
1
Parent(s):
72bea82
Update README.md
Browse files
README.md
CHANGED
@@ -62,14 +62,14 @@ print(response)
|
|
62 |
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __33GB__ data from that using a ratio of the actual data size. Total trained tokens are __4.4B__ tokens.
|
63 |
|
64 |
Data sources summary:
|
65 |
-
- Web documents: Extracted,
|
66 |
-
- Books: Extracted,
|
67 |
- Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
|
68 |
- Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
|
69 |
- Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
|
70 |
- Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
|
71 |
- Synthetic data: We generated synthetic data using a Bangla LLM model
|
72 |
-
- Others: We scrapped some selected
|
73 |
|
74 |
|
75 |
## Benchmarks
|
|
|
62 |
**Overview:** We have collected a large Bangla raw dataset of text data from a wide variety of sources. Our collected data so far includes a mix of web documents, books, translated text, transliterated text, transcribed text, code-mixed text, conversations, and open-source raw data. The dataset is cleaned and filtered by different filtering criteria to ensure the quality of the data. Our collected data size is roughly around 268 GB. We separated __33GB__ data from that using a ratio of the actual data size. Total trained tokens are __4.4B__ tokens.
|
63 |
|
64 |
Data sources summary:
|
65 |
+
- Web documents: Extracted, cleaned, and filtered common crawl data
|
66 |
+
- Books: Extracted, cleaned, filtered books data
|
67 |
- Transcribed text: Used in-house Bangla ASR model to transcribe Bangla audio data
|
68 |
- Translation data: We trained an English-Bangla translation LLM model and used it to translate English data to Bangla
|
69 |
- Code-mixed data: We trained an English-Bangla code-mixed LLM model and used it to generate code-mixed data
|
70 |
- Transliteration data: We trained a Bangla-English transliteration LLM model and used it to generate transliterated data
|
71 |
- Synthetic data: We generated synthetic data using a Bangla LLM model
|
72 |
+
- Others: We scrapped data from some selected websites, used open-source data, and used some other data sources
|
73 |
|
74 |
|
75 |
## Benchmarks
|