prajdabre commited on
Commit
2a69bba
1 Parent(s): 35d9d28

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -0
README.md CHANGED
@@ -4,6 +4,7 @@ IndicBART is a multilingual, sequence-to-sequence pre-trained model focusing on
4
  <li >Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, Telugu and English. Not all of these languages are supported by mBART50 and mT5. </li>
5
  <li >The model is much smaller than the mBART and mT5(-base) models, so less computationally expensive for finetuning and decoding. </li>
6
  <li> Trained on large Indic language corpora (452 million sentences and 9 billion tokens) which also includes Indian English content. </li>
 
7
  </ul>
8
 
9
  You can read more about IndicBART in this <a href="https://arxiv.org/abs/2109.02903">paper</a>.
@@ -92,6 +93,7 @@ print(decoded_output) # मला ओळखलं पाहिजे
92
  1. This is compatible with the latest version of transformers but was developed with version 4.3.2 so consider using 4.3.2 if possible.
93
  2. While I have only shown how to get logits and loss and how to generate outputs, you can do pretty much everything the MBartForConditionalGeneration class can do as in https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartForConditionalGeneration
94
  3. Note that the tokenizer I have used is based on sentencepiece and not BPE. Therefore, I used the AlbertTokenizer class and not the MBartTokenizer class.
 
95
 
96
  # Fine-tuning on a downstream task
97
 
 
4
  <li >Supported languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Odiya, Punjabi, Kannada, Malayalam, Tamil, Telugu and English. Not all of these languages are supported by mBART50 and mT5. </li>
5
  <li >The model is much smaller than the mBART and mT5(-base) models, so less computationally expensive for finetuning and decoding. </li>
6
  <li> Trained on large Indic language corpora (452 million sentences and 9 billion tokens) which also includes Indian English content. </li>
7
+ <li> All languages, except English, have been represented in Devanagari script to encourage transfer learning among the related languages. </li>
8
  </ul>
9
 
10
  You can read more about IndicBART in this <a href="https://arxiv.org/abs/2109.02903">paper</a>.
 
93
  1. This is compatible with the latest version of transformers but was developed with version 4.3.2 so consider using 4.3.2 if possible.
94
  2. While I have only shown how to get logits and loss and how to generate outputs, you can do pretty much everything the MBartForConditionalGeneration class can do as in https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartForConditionalGeneration
95
  3. Note that the tokenizer I have used is based on sentencepiece and not BPE. Therefore, I used the AlbertTokenizer class and not the MBartTokenizer class.
96
+ 4. If you wish to use any language written in a non-Devanagari script (except English), then you should first convert it to Devanagari using the <a href="https://github.com/anoopkunchukuttan/indic_nlp_library">Indic NLP Library</a>. After you get the output, you should convert it back into the original script.
97
 
98
  # Fine-tuning on a downstream task
99