proper embedding for llama-2-7b-chat.Q4_K_M.gguf
Hi there, I am trying to get this model working with a llama index vector store.
I can follow this doc, with modifications for a local vector store of some markdown files, and get good responses.
https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_2_llama_cpp.html
That uses this model: "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
with this embedding: "BAAI/bge-small-en-v1.5"
I see on the model card that llama-2-7b-chat.Q4_K_M.gguf is recommended. However when using that model with any of the "BAAI/bge-small-en-v1.5" variants I get dimension mis-size errors.
example:ValueError: shapes (1536,) and (768,) not aligned: 1536 (dim 0) != 768 (dim 0)
This occurs when pinging the URL, or loading a local copy of the model.
Non working example:
from llama_index.llms import LlamaCPP
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
llm = LlamaCPP(
# You can pass in the URL to a GGML model to download it automatically
model_url=model_url,
# optionally, you can set the path to a pre-downloaded model instead of model_url
# model_path="./models/TheBloke/Llama-2-7B-chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf",
temperature=0.1,
max_new_tokens=256,
# llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
context_window=3900,
# kwargs to pass to __call__()
generate_kwargs={},
# kwargs to pass to __init__()
# set to at least 1 to use GPU
model_kwargs={"n_gpu_layers": 1},
# transform inputs into Llama2 format
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
# verbose=True,
)
Am I right that this is some sort of mismatch between the embedding model, the storage in a VectorStoreIndex, and the instantiation of the LLM version?
Again I am able to get the "llama-2-13b-chat.Q4_0.gguf" version to work, but not the "llama-2-7b-chat.Q4_K_M.gguf"
Thanks for reading.
How did you solve this?