TheBloke/Llama-2-7B-Chat-GGML · llama-2-7b-chat response too little tokens?

Jul 25, 2023

•

edited Jul 25, 2023

Hi,

I deploy llama-2-7b-chat.ggmlv3.q6_K.bin with llama-cpp-python[server].

Try to access it with OpenAI API,

curl -X 'POST' \
  'http://llama07.server.com/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "Write a poem for France?",
      "role": "user"
    }
  ]
}'

its response body,

{
  "id": "chatcmpl-93a635e0-af7a-4b78-8e96-f93c84b59c69",
  "object": "chat.completion",
  "created": 1690286307,
  "model": "/models/llama-2-7b-chat.ggmlv3.q6_K.bin",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Of course! Here is a poem for France:\n\nFrance, the land"
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 26,
    "completion_tokens": 16,
    "total_tokens": 42
  }
}

It always return little tokens, how can I get the full poem in this case ?

Thanks a lot for your job!

st01cs

Jul 25, 2023

By the way, I set llama-cpp-python with following params,

    -e USE_MLOCK=0 \
    -e N_THREADS=64 \
    -e N_BATCH=2048 \
    -e N_CTX=8192 \

st01cs

Aug 1, 2023

•

edited Aug 1, 2023

curl -X 'POST' \
  'http://llama07.server.com/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "max_tokens": 512,
  "messages": [
    {
      "content": "You are a helpful assistant.",
      "role": "system"
    },
    {
      "content": "Write a poem for France?",
      "role": "user"
    }
  ]
}'

st01cs changed discussion status to closed Aug 1, 2023

awarity-dev

Nov 9, 2023

Did you ever get a solution to this?