why "MLA is not supported with awq_marlin quantization. Disabling MLA." with 4090 * 32 (4 node / vllm 0.7.2)

#14
by FightLLM - opened

oot@stone-2:~# VLLM_WORKER_MULTIPROC_METHOD=spawn python3 -m vllm.entrypoints.openai.api_server --host 0.0.0.0 --max-model-len 65536 --max-num-batched-tokens 65536 --trust-remote-code --tensor-parallel-size 16 --pipeline-parallel-size 2 --gpu-memory-utilization 0.97 --dtype float16 --served-model-name deepseek-reasoner --model /data/DeepSeek-R1-AWQ/
INFO 02-20 13:09:07 init.py:190] Automatically detected platform cuda.
INFO 02-20 13:09:16 api_server.py:840] vLLM API server version 0.7.2
INFO 02-20 13:09:16 api_server.py:841] args: Namespace(host='0.0.0.0', port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, chat_template_content_format='auto', response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_request_id_headers=False, enable_auto_tool_choice=False, enable_reasoning=False, reasoning_parser=None, tool_call_parser=None, tool_parser_plugin='', model='/data/DeepSeek-R1-AWQ/', task='auto', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=True, allowed_local_media_path=None, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='float16', kv_cache_dtype='auto', max_model_len=65536, guided_decoding_backend='xgrammar', logits_processor_pattern=None, model_impl='auto', distributed_executor_backend=None, pipeline_parallel_size=2, tensor_parallel_size=16, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=None, enable_prefix_caching=None, disable_sliding_window=False, use_v2_block_manager=True, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.97, num_gpu_blocks_override=None, max_num_batched_tokens=65536, max_num_seqs=None, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, hf_overrides=None, enforce_eager=False, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, disable_mm_preprocessor_cache=False, enable_lora=False, enable_lora_bias=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=['deepseek-reasoner'], qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, scheduling_policy='fcfs', override_neuron_config=None, override_pooler_config=None, compilation_config=None, kv_transfer_config=None, worker_cls='auto', generation_config=None, override_generation_config=None, enable_sleep_mode=False, calculate_kv_scales=False, disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, enable_prompt_tokens_details=False)

WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.
WARNING 02-20 13:10:23 config.py:993] MLA is not supported with awq_marlin quantization. Disabling MLA.

is it work with vllm 0.7.3?

Cognitive Computations org

Please build from source, the latest dev version contains the commit which enables MLA.

vllm 0.7.3 support

v2ray changed discussion status to closed

Sign up or log in to comment