stripnet / markdown /topic_modeling.md
stephenleo's picture
adding blog post
5ab63e8

A newer version of the Streamlit SDK is available: 1.41.1

Upgrade
  • The first step in STriPNet is to run Topic Modeling using the BERTopic library.
  • BERTopic internally uses Sentence Transformer models to convert text to embeddings, clusters them and extracts keywords from each cluster.
  • Specifically, since STriPNet is intended to be used with scientific papers, we're using the SPECTER pretrained sentence transformers model by Allen AI.
  • The Minimum topic size and N-gram range parameters control the clustering and keyword extraction of BERTopic respectively. Hover over the tooltip of each parameter to get more information about them. STriPNet internally chooses some heuristically tuned parameters depending on the data you've uploaded. Feel free to play around with the parameters until you get good topics.
  • You can visualize the quality of the topic modeling in various ways provided by the dropdown menu Select Topic Modeling Visualization.
  • Finally, please take note that BERTopic results change with every run so the topics extracted might change everytime you run STriPNet even on the same data with the same settings. If your topics look weird, a simple page refresh or a $+1$ increase to Minimum topic size might fix it!