Spaces:

kritsadaK
/

ThaiSentenceSimilarityApp

Sleeping

App Files Files Community

kritsadaK commited on Oct 29, 2024

Commit

7934d70

1 Parent(s): d49af1e

Initial commit

Browse files

Files changed (1) hide show

app.py +22 -38

app.py CHANGED Viewed

@@ -2,8 +2,7 @@ import warnings
 import torchvision
 import torch
 import pandas as pd
-from transformers.pipelines import pipeline
-from transformers import AutoTokenizer, AutoModel
 from sklearn.metrics.pairwise import cosine_similarity
 import streamlit as st
@@ -11,56 +10,41 @@ import streamlit as st
 torchvision.disable_beta_transforms_warning()
 warnings.filterwarnings("ignore", category=UserWarning, module="torchvision")
-# Initialize fill-mask pipeline and model/tokenizer for embedding
-pipe = pipeline("fill-mask", model="airesearch/wangchanberta-base-att-spm-uncased", framework="pt", use_fast=False)
-tokenizer = AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased", use_fast=False)
 model = AutoModel.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
 # Function to generate embeddings for full sentences
 def get_embedding(text):
-    inputs = tokenizer(text, return_tensors="pt")
     with torch.no_grad():
         outputs = model(**inputs)
     return outputs.last_hidden_state[:, 0, :].cpu().numpy()
 # Streamlit app setup
 st.title("Thai Full Sentence Similarity App")
-# Explanation Section
-st.write("""
-### How This App Works
-This app uses a mask-filling model to predict possible words or phrases that could fill in the `<mask>` token in a given sentence. It then calculates the similarity of each prediction with the original sentence to determine the most contextually appropriate completion.
-### Example Sentence
-In this example, we have the following sentence in Thai with a `<mask>` token:
-- **Input**: `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน <mask> เพื่อสัมผัสธรรมชาติ"`
-- **Translation**: "Many tourists choose to visit `<mask>` to experience nature."
-The `<mask>` token represents a location popular for its natural beauty.
-### Potential Predictions
-Here are some possible predictions the model might generate for `<mask>`:
-1. `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน เชียงใหม่ เพื่อสัมผัสธรรมชาติ"` - Chiang Mai
-2. `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน เขาใหญ่ เพื่อสัมผัสธรรมชาติ"` - Khao Yai
-3. `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน เกาะสมุย เพื่อสัมผัสธรรมชาติ"` - Koh Samui
-4. `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน ภูเก็ต เพื่อสัมผัสธรรมชาติ"` - Phuket
-### Results Table
-For each prediction, the app calculates:
-- **Similarity Score**: Indicates how similar the predicted sentence is to the original input.
-- **Model Score**: Represents the model's confidence in the predicted word for `<mask>`.
-### Most Similar Prediction
-The app will display the most contextually similar prediction based on the similarity score. For example:
-- **Most Similar Prediction**: `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน เชียงใหม่ เพื่อสัมผัสธรรมชาติ"`
-- **Similarity Score**: 0.89
-- **Model Score**: 0.16
-Feel free to enter your own sentence with `<mask>` and explore the predictions!
 """)
 # User input box
 st.subheader("Input Text")
-input_text = st.text_input("Enter a sentence with `<mask>` to find similar predictions:", "ผู้ใช้งานท่าอากาศยานนานาชาติ <mask> มีกว่าสามล้านคน")
 # Ensure the input includes a `<mask>`
 if "<mask>" not in input_text:

 import torchvision
 import torch
 import pandas as pd
+from transformers import pipeline, AutoTokenizer, AutoModel
 from sklearn.metrics.pairwise import cosine_similarity
 import streamlit as st
 torchvision.disable_beta_transforms_warning()
 warnings.filterwarnings("ignore", category=UserWarning, module="torchvision")
+# Initialize fill-mask pipeline and model/tokenizer for embedding with slow tokenizer
+pipe = pipeline(
+    "fill-mask",
+    model="airesearch/wangchanberta-base-att-spm-uncased",
+    tokenizer=AutoTokenizer.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased", use_fast=False),
+    framework="pt"
+)
 model = AutoModel.from_pretrained("airesearch/wangchanberta-base-att-spm-uncased")
 # Function to generate embeddings for full sentences
 def get_embedding(text):
+    inputs = pipe.tokenizer(text, return_tensors="pt")
     with torch.no_grad():
         outputs = model(**inputs)
     return outputs.last_hidden_state[:, 0, :].cpu().numpy()
 # Streamlit app setup
 st.title("Thai Full Sentence Similarity App")
+# Explanation of example usage
+st.markdown("""
+### Example Sentence with Mask:
+**Input:** `"นักท่องเที่ยวจำนวนมากเลือกที่จะไปเยือน <mask> เพื่อสัมผัสธรรมชาติ"`
+In this example, the model will replace `<mask>` with possible locations in Thailand, such as:
+- "เชียงใหม่" for "Chiang Mai"
+- "เขาใหญ่" for "Khao Yai"
+- "ภูเก็ต" for "Phuket"
+The app will compute the similarity between the full sentences generated and the baseline sentence without `<mask>`.
 """)
 # User input box
 st.subheader("Input Text")
+input_text = st.text_input("Enter a sentence with `<mask>` to find similar predictions:", "นักท่อ���เที่ยวจำนวนมากเลือกที่จะไปเยือน <mask> เพื่อสัมผัสธรรมชาติ")
 # Ensure the input includes a `<mask>`
 if "<mask>" not in input_text: