Spaces:
Runtime error
A newer version of the Gradio SDK is available:
5.12.0
title: MoE
emoji: π
colorFrom: red
colorTo: red
sdk: gradio
sdk_version: 4.12.0
app_file: app.py
pinned: false
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
#######################################################
π¦ apply_chat_template
in HuggingFace is just great. π₯π
#######################################################
π Why templates? β
An increasingly common use case for LLMs is chat. In a chat context, rather than continuing a single string of text (as is the case with a standard language model), the model instead continues a conversation that consists of one or more messages, each of which includes a role as well as message text.
Chat models have been trained with very different formats for converting conversations into a single tokenizable string. Using a format different from the format a model was trained with will usually cause severe, silent performance degradation, so matching the format used during training is extremely important!
So, apply_chat_template
attribute can be used to save the chat format the model was trained with. This attribute contains a Jinja template that converts conversation histories into a correctly formatted string.
#######################################################
Structure of the history
Variable
#######################################################
π The history variable in your script is structured as a list of dialogue pairs, where each pair consists of a user's message and the assistant's response. It's a sequential record of the conversation. Here is an example of its structure in JSON format:
[
["User's first instruction", "Assistant's first response"],
["User's second instruction", "Assistant's second response"],
["User's third instruction", null]
]
Structure within format_chat_history
Function
In the format_chat_history
function, the messages
list is created using dictionaries to format each message. This is where the curly braces {}
appear. Each message is a dictionary with two keys: role
and content
.
messages = []
for dialog in history[:-1]:
instruction, response = dialog[0], dialog[1]
messages.extend([
{"role": "user", "content": instruction},
{"role": "assistant", "content": response}
])
new_instruction = history[-1][0].strip()
messages.append({"role": "user", "content": new_instruction})
In this snippet:
- Each
dialog
fromhistory
is processed. - For each
dialog
, two dictionaries are created and added to themessages
list. - Each dictionary represents a message, where:
{"role": "user", "content": instruction}
represents the user's message.{"role": "assistant", "content": response}
represents the assistant's response.
The curly braces {}
are used to create these dictionaries, which are then formatted into a prompt suitable for the Mixtral model. The history
variable itself, however, remains a list of lists and does not use curly braces.
==============
The conditional (if dialog[i % 2])
ensures that messages that are None (like the latest assistant response in an ongoing conversation) are not included.
In a typical conversation history, the latest response from the assistant might not exist yet (i.e., it's None) because the user just sent a message and the assistant hasn't replied yet. This check ensures that such None values are not included in the formatted output, as they don't represent actual messages and would not be meaningful input for the model.
#######################################################
format_chat_history(history)
#######################################################
Input Structure: The function expects history
as an input, which is a list of dialogues. Each dialogue in the history is a tuple with two elements: the first element is the user's instruction, and the second is the assistant's response. The most recent dialogue will have None
as the response since it's the current user query awaiting a response.
Processing Steps:
- The function iterates over each dialogue in the history.
- For each dialogue, it extracts the user instruction and the assistant's response.
- These are then formatted into a list of dictionaries, where each dictionary represents a message with two keys:
role
andcontent
. Therole
can be either"user"
or"assistant"
, indicating who said what, andcontent
is the actual text of the message.
Formatting the Prompt: After processing the history, the function uses the apply_chat_template
method of the tokenizer. This method takes the formatted messages and converts them into a prompt string. This string is a properly structured input that the Mixtral model can process.
Tokenization and Generation Prompt: The apply_chat_template
method is called with tokenize=False
and add_generation_prompt=True
. This means that the function will format the messages into a string without tokenizing them (since tokenization is typically handled by the model pipeline later) and will add a generation prompt at the end. This generation prompt is a cue for the model to start generating a response.
#######################################################
thread = Thread(target=pipeline, kwargs=kwargs)
#######################################################
The use of threading in this context is aimed at preventing the UI from freezing during the execution of computationally intensive tasks, though the specific implementation here does raise some questions about its effectiveness in achieving truly asynchronous behavior.
Asynchronous Operation:
- The
pipeline
function, which generates responses using the Mixtral model, can be time-consuming, especially for complex text generation tasks. Running this function synchronously (i.e., in the main thread) would block the execution of other parts of the code until it completes. This can lead to a frozen or unresponsive user interface, especially in a GUI application like one using Gradio.
- The
Maintaining UI Responsiveness:
- By using a separate thread for the
pipeline
function, the main thread (which handles the user interface) remains free to respond to user inputs and other events. This way, the application remains responsive even while the model is processing.
- By using a separate thread for the
Analysis of the Thread Implementation
Thread(target=pipeline, kwargs=kwargs)
: This line creates a new thread object, setting thepipeline
function as the target to be executed in this thread. Thekwargs
are the arguments passed to thepipeline
function.thread.start()
: This initiates the execution of thepipeline
function in a new thread, separate from the main thread.