from adapters import AutoAdapterModel # Ensure this library is correctly installed from transformers import AutoTokenizer import gradio as gr import onnxruntime as ort import numpy as np import string from huggingface_hub import InferenceClient import os # Load Base Model and Adapter BASE_MODEL = "Qwen/Qwen2.5-1.5B-Instruct" # Replace with the actual base model ID ADAPTER_NAME = "ystemsrx/Qwen2.5-Sex" # Replace with the correct adapter name model = AutoAdapterModel.from_pretrained(BASE_MODEL) model.load_adapter(ADAPTER_NAME, set_active=True) tokenizer = AutoTokenizer.from_pretrained(BASE_MODEL) # ONNX setup ONNX_FILENAME = "model_quantized.onnx" onnx_session = ort.InferenceSession(ONNX_FILENAME, providers=["CPUExecutionProvider"]) PUNCS = string.punctuation.replace("'", "") MAX_HISTORY = 4 MAX_HISTORY_TOKENS = 512 EOU_THRESHOLD = 0.5 # Softmax function def softmax(logits): exp_logits = np.exp(logits - np.max(logits)) return exp_logits / np.sum(exp_logits) # Normalize text def normalize_text(text): def strip_puncs(text): return text.translate(str.maketrans("", "", PUNCS)) return " ".join(strip_puncs(text).lower().split()) # Format chat context def format_chat_ctx(chat_ctx): new_chat_ctx = [] for msg in chat_ctx: if msg["role"] in ("user", "assistant"): content = normalize_text(msg["content"]) if content: msg["content"] = content new_chat_ctx.append(msg) convo_text = tokenizer.apply_chat_template( new_chat_ctx, add_generation_prompt=False, add_special_tokens=False, tokenize=False ) ix = convo_text.rfind("<|im_end|>") return convo_text[:ix] # Calculate EOU probability def calculate_eou(chat_ctx, session): formatted_text = format_chat_ctx(chat_ctx[-MAX_HISTORY:]) inputs = tokenizer( formatted_text, return_tensors="np", truncation=True, max_length=MAX_HISTORY_TOKENS, ) input_ids = np.array(inputs["input_ids"], dtype=np.int64) outputs = session.run(["logits"], {"input_ids": input_ids}) logits = outputs[0][0, -1, :] probs = softmax(logits) eou_token_id = tokenizer.encode("<|im_end|>")[-1] return probs[eou_token_id] # Respond function def respond( message, history: list[tuple[str, str]], max_tokens, temperature, top_p, ): messages = [{"role": "system", "content": os.environ.get("CHARACTER_DESC")}] for val in history[-10:]: if val[0]: messages.append({"role": "user", "content": val[0]}) if val[1]: messages.append({"role": "assistant", "content": val[1]}) messages.append({"role": "user", "content": message}) eou_prob = calculate_eou(messages, onnx_session) print(f"EOU Probability: {eou_prob}") if eou_prob < EOU_THRESHOLD: yield "[Waiting for user to continue input...]" return response = "" for message in qwen_client.chat_completion( messages, max_tokens=max_tokens, stream=True, temperature=temperature, top_p=top_p, ): token = message.choices[0].delta.content response += token yield response print(f"Generated response: {response}") # Gradio interface demo = gr.ChatInterface( respond, ) if __name__ == "__main__": demo.launch()