Deployment as server?

#1
by ngxson HF staff - opened

Hi, congrats for the release of the model :)

Just out of curiosity, I tried deploying the model via lmdeploy

The idea is to expose the access as an openai-compatible API. An easy deployment will allow many enterprises to adopt this model, making it even more popular.

pip install lmdeploy timm
pip install flash-attn --no-build-isolation

lmdeploy serve api_server 5CD-AI/Vintern-1B-v3_5 --backend turbomind --server-port 23333

Then access it using openai API:

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://127.0.0.1:23333/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
    model=model_name,
    messages=[{
        'role':
        'user',
        'content': [{
            'type': 'text',
            'text': 'mô tả hình ảnh này',
        }, {
            'type': 'image_url',
            'image_url': {
                'url':
                'https://modelscope.oss-cn-beijing.aliyuncs.com/resource/tiger.jpeg',
            },
        }],
    }],
    temperature=0.8,
    top_p=0.8,
    max_tokens=64)
print(response)

But run into an error:

...
    async for res in result_generator:
  File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/serve/async_engine.py", line 528, in generate
    prompt_input = await self._get_prompt_input(prompt,
  File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/serve/vl_async_engine.py", line 100, in _get_prompt_input
    results = await self.vl_encoder.wrap_for_turbomind(
  File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/vl/engine.py", line 110, in wrap_for_turbomind
    result = self.model.to_turbomind(messages, chat_template, tokenizer,
  File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/vl/model/internvl.py", line 289, in to_turbomind
    return self.to_turbomind_aux(messages, prompt, IMAGE_TOKEN, tokenizer,
  File "/home/user/miniconda/lib/python3.9/site-packages/lmdeploy/vl/model/base.py", line 219, in to_turbomind_aux
    segs = prompt.split(IMAGE_TOKEN)
AttributeError: 'NoneType' object has no attribute 'split'

Not sure how to debug this (unfortunately, outside of my knowledge), but let me know if I can help!

ngxson changed discussion title from Deployment? to Deployment as server?
Fifth Civil Defender - 5CD org

Oh, good idea Son ! I will fix it ASAP !

I tried deploying the model with lmdeploy but faced similar issues, even when using Docker and the offline interface. This suggests there might be a deeper issue with the deployment interfaces for these tools. It's interesting to hear that switching to vLLM worked seamlessly for both Docker and offline modes—this could indeed point to a problem specific to lmdeploy's handling of certain configurations.
Now I still try some ways to deploy with lmdeploy

from openai import OpenAI

client = OpenAI(api_key='YOUR_API_KEY', base_url='http://172.18.249.58:8000/v1')
model_name = client.models.list().data[0].id
response = client.chat.completions.create(
model=model_name,
messages=[{
'role':
'user',
'content': [{
'type': 'text',
'text': 'Trích thông tin từ ảnh trả về dạng Json',
}, {
'type': 'image_url',
'image_url': {
'url':
'https://huggingface.co/erax-ai/EraX-VL-7B-V1.5/resolve/main/images/trinhquangduy_front.jpg',
},
}],
}],
temperature=0.8,
top_p=0.8,
max_tokens=64)
print(response)


(Aphrodite) gitlab@AIMACHINE:~/whisperx_test$ python test2.py
ChatCompletion(id='chatcmpl-1cf4eb1850d649eda72c49cc3289c581', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='```json\n{\n "Họ và tên": "TRỊNH QUANG DUY",\n "Ngày sinh": "04/09/1994",\n "Giới tính": "Nam",\n "Quốc tịch": "Việt Nam",\n "Nơi thường trú', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1736909146, model='5CD-AI/Vintern-1B-v3_5', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=1327, total_tokens=1391, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

Fifth Civil Defender - 5CD org

Wow, you surprised me, vLLM team is so handsome !!!!
image.png

Fifth Civil Defender - 5CD org
edited 12 days ago

Good new ! v3.5 model now can work better in MRZ task as I promised @thanhtan2136

User: Liệt kê toàn bộ văn bản có trong ảnh.
Assistant: Đặc điểm nhân dạng / Personal identification:
Sẹo chấm C: 2cm dưới trước
đuôi mắt phải
Ngày, tháng, năm / Date, month, year: 04/08/2022
CỤC TRƯỞNG CỤC CẢNH SÁT
QUẢN LÝ HÀNH CHÍNH VỀ TRẤT TỰ XÃ HỘI
DIRECTOR GENERAL OF THE POLICE DEPARTMENT FOR ADMINISTRATIVE MANAGEMENT和社会 ORDER
Tô Văn Huệ
Ngón trỏ trái / Left index finger
Ngón trỏ phải / Right index finger
IDVNM1890129363001189012936<<4
8911207F2911205VNM<<<<<<<<<<<< NGUYENKIM<ANH<<<<<<<<<<<<

Thank you for your positivity! I'm glad to have your support. Your model has been incredibly helpful in my work and experiments. If I encounter any issues or come up with ideas for improvement, I’ll be sure to share them. Your contributions to the community are greatly appreciated. Wishing you continued success with your projects!

Hi @khang119966 @ngxson ,
I've been testing this model. I downloaded it to my local computer and renamed it to InternVL2_5-1B. While running it with Lmdeploy, it runs normally, but the results are no good. Therefore, I believe the issue lies with Lmdeploy rather than the 5CD-AI/Vintern-1B-v3_5 model itself.

Log:
(deloyllm) gitlab@ai-server:~/models$ lmdeploy serve api_server ./InternVL2_5-1B --server-port 23333
/home/gitlab/miniconda3/envs/deloyllm/lib/python3.11/site-packages/timm/models/layers/init.py:48: FutureWarning: Importing from timm.models.layers is deprecated, please import via timm.layers
warnings.warn(f"Importing from {name} is deprecated, please import via timm.layers", FutureWarning)
Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes, but the current dype in Qwen2ForCausalLM is torch.float32. You should run training or inference using Automatic Mixed-Precision via the with torch.autocast(device_type='torch_device'): decorator, or load the model with the torch_dtype argument. Example: model = AutoModel.from_pretrained("openai/whisper-tiny", attn_implementation="flash_attention_2", torch_dtype=torch.float16)
[WARNING] gemm_config.in is not found; using default GEMM algo
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
HINT: Please open http://0.0.0.0:23333 in a browser for detailed api usage!!!
INFO: Started server process [16361]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:23333 (Press CTRL+C to quit)
INFO: 192.168.1.136:37562 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 192.168.1.136:37562 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 192.168.1.136:33436 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 192.168.1.136:33436 - "POST /v1/chat/completions HTTP/1.1" 200 OK

ChatCompletion(id='1', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content="EE\n{'họEEEE': 'TRỊNH QUANGEE', 'ngàyBS': '04EE91994', 'ngàyquýEE': '04EEEE12351EE', 'ngàyquyếE': '04091994', 'ngàyngBSEEEE': '04091994EE', 'ngàyngBSEEEE': '04091994E', 'ngàyngBSEE': '04EE91994EEE', 'ngàyngBSEEEEEEEE': '040919EE4', 'ngàyngBS': '040919EE4', 'ngàyngBSEE': '04091994E', 'ngàyngBSEEEEEEEE': '04091994E', 'ngàyngBSEE': '040919EE4', 'ngàyngBSEEEEEEEEEEEEEEE': '04091994E', 'ngàyngBSEEEEEE': '040919EE4', 'ngàyngBSEEEEEEEEEEEEE': '040919EE4EEEE', 'ngàyngBSEEEEEEE': '04091994E', 'ngàyngBSEEEEEEEEEEEEEEEEEE': '040919EE4', 'ngàyngBSEEEEE': '040919EEEEEEEE4', 'ngàyngBSEEEEEEEEEEE': '04091994EE4', 'ngàyngBSEEEEEEE': '040919EE4', 'ngàyngBSEEEEEEEEeeEEEEEEE': '040919EE4E', 'ngàyngBSEEEEE': '04091994EEEEEE4', 'ngàyngBSEEEEEEEEEEEEEEEEE': '040919EE4EE4', 'ngàyngBSEEEEEEE': '040919EEEE4EEEEEE', 'ngàyngBSEEEEEEEEeeEeeEeeEeeEee': '04091994E4', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEEE': '04091994E4E', 'ngàyngBSEEEEEEEEE': '040919EE4EE4EE4', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEEEEEEE': '04091994E4E4EEE4', 'ngàyngBSEEEEEEEEEE': '04091994E4E4E4', 'ngàyngBSEEEEE': '04EE91994EEEEEE4', 'ngàyngBSEEEEEEEEEEEEEEEEEEEEEEEE': '040919EE4EEE4E4', 'ngàyngBSEEEEEEEEeeEeeEeeEEE': '040919EE4EEE4EE4', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEeeEEE', 'ngàyngBSEEEEEEE': '04EE91994E4EEE4EE4EE4', 'ngàyngBSEEEEEEEEEEEEEEEE': '04EEEEEEEE4E4E4E4EE4', 'ngàyngBSEEEEEeeEeeEEE': '04091994E4EEE4EE4EE4E', 'ngàyngBSEEEEEeeEeeEeeEeeEeeEeeEEE': '040919EE", refusal=None, role='assistant', audio=None, function_call=None, tool_calls=None))], created=1736916605, model='./InternVL2_5-1B', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=1025, prompt_tokens=1338, total_tokens=2363, completion_tokens_details=None, prompt_tokens_details=None))

@thanhtan2136 Wow thanks for the hint about vllm, I completely forgot to check if vllm has support for intern-vl.

I tried with vllm and it works magically 😮

Here I'm using HF space (notebook) with L4 GPU

pip install vllm
vllm serve 5CD-AI/Vintern-1B-v3_5

Then send the request to 127.0.0.1:8000

ChatCompletion(id='chatcmpl-b4e6424360454ba7b8dad973112660b0', choices=[Choice(finish_reason='length', index=0, logprobs=None, message=ChatCompletionMessage(content='Bức ảnh chụp một con hổ đứng trên một thảm cỏ xanh mướt. Hổ có màu sắc cam và đen, với những đường vân đen trắng xen kẽ. Hổ đang nhìn thẳng vào ống kính, tạo cảm giác vui vẻ và tự tin. Bức ảnh', refusal=None, role='assistant', audio=None, function_call=None, tool_calls=[]), stop_reason=None)], created=1736932228, model='5CD-AI/Vintern-1B-v3_5', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=64, prompt_tokens=811, total_tokens=875, completion_tokens_details=None, prompt_tokens_details=None), prompt_logprobs=None)

Nice!

@khang119966 It would be nice to mention this somewhere in the README. I believe that many vietnamese enterprises are looking for a solution like this!

Btw with vllm working I think we can ignore lmdeploy for now haha. I tried lmdeploy because that was the only thing I could find online

yeah I can deploy this with vllm but I don't known how to control this.... My testing(with Vllm) not have a good result so hope @khang119966 will have update a solution

Fifth Civil Defender - 5CD org
edited 12 days ago

@thanhtan2136 @ngxson yeah, I will test the vllm and tune the best config. All of my servers are busy today :) . the next few days.

khang119966 changed discussion status to closed
khang119966 changed discussion status to open

Hi @khang119966 ,
I am using this configuration with vLLM, and I have achieved good results. I hope this can help you as well.
I using this config for Scan Id card
GENERATION_CONFIG = {
"temperature": 0.01,
"top_p": 0.1,
"min_p": 0.1,
"top_k": 1,
"max_tokens": 512,
"repetition_penalty": 1.1,
"best_of": 1,
}

Sign up or log in to comment