TheBloke/Llama-2-7B-Chat-GGUF · proper embedding for llama-2-7b-chat.Q4_K

Hi there, I am trying to get this model working with a llama index vector store.

I can follow this doc, with modifications for a local vector store of some markdown files, and get good responses.
https://gpt-index.readthedocs.io/en/latest/examples/llm/llama_2_llama_cpp.html
That uses this model: "https://huggingface.co/TheBloke/Llama-2-13B-chat-GGUF/resolve/main/llama-2-13b-chat.Q4_0.gguf"
with this embedding: "BAAI/bge-small-en-v1.5"

I see on the model card that llama-2-7b-chat.Q4_K_M.gguf is recommended. However when using that model with any of the "BAAI/bge-small-en-v1.5" variants I get dimension mis-size errors.
example:
ValueError: shapes (1536,) and (768,) not aligned: 1536 (dim 0) != 768 (dim 0)

This occurs when pinging the URL, or loading a local copy of the model.
Non working example:

from llama_index.llms import LlamaCPP
model_url = "https://huggingface.co/TheBloke/Llama-2-7B-chat-GGUF/resolve/main/llama-2-7b-chat.Q4_K_M.gguf"
llm = LlamaCPP(
        # You can pass in the URL to a GGML model to download it automatically
        model_url=model_url,
        # optionally, you can set the path to a pre-downloaded model instead of model_url
        # model_path="./models/TheBloke/Llama-2-7B-chat-GGUF/llama-2-7b-chat.Q4_K_M.gguf",
        temperature=0.1,
        max_new_tokens=256,
        # llama2 has a context window of 4096 tokens, but we set it lower to allow for some wiggle room
        context_window=3900,
        # kwargs to pass to __call__()
        generate_kwargs={},
        # kwargs to pass to __init__()
        # set to at least 1 to use GPU
        model_kwargs={"n_gpu_layers": 1},
        # transform inputs into Llama2 format
        messages_to_prompt=messages_to_prompt,
        completion_to_prompt=completion_to_prompt,
        # verbose=True,
    )

Am I right that this is some sort of mismatch between the embedding model, the storage in a VectorStoreIndex, and the instantiation of the LLM version?

Again I am able to get the "llama-2-13b-chat.Q4_0.gguf" version to work, but not the "llama-2-7b-chat.Q4_K_M.gguf"

Thanks for reading.

TheBloke
/

Llama-2-7B-Chat-GGUF

proper embedding for llama-2-7b-chat.Q4_K_M.gguf