How to apply this model on PubMed full-text?
Hi
@alvaroalon2
, I am trying to apply this model to highlight Disease entity on the full-text of a pubmed document. However, using all the default parameters, I noticed only the disease terms in the Abstract section were highlighted. I understood this model was trained on the ncbi_disease dataset which is 'a collection of 793 PubMed abstracts'. Is that why it's only able to highlight entities in the Abstract section? Is there any parameter I can apply to make the model applicable to the full-text of a pubmed paper?
Thanks!
Hi! No, this is not the reason. The reason is that the model in which this is based, BERT, can only take as input sequences up to 512 tokens. So, when you apply it to large documents like full-text pubmed documents, then just the first sequences will be inferred. To address this limitation I implemented the following library in which larger documents can be analyzed: https://github.com/librairy/bio-ner
You can use pipeline to chunk longer text now https://huggingface.co/docs/transformers/v4.32.0/en/main_classes/pipelines#transformers.TokenClassificationPipeline.stride