[AI] vLLM 部署地端模型系統訊息記錄 A100
2025 年八月釋出的 GPT-OSS-20B 是普遍地端採用的模型,一釋出的時候就有 128K 的 max-model-len,2026 年四月 Google Gemma-4 上下文大小更可以到兩倍,本篇因為工作需要,需要嘗試使用 Gemma-4 模型,因此順便記錄不同地端 (GPT-OSS-20B, Gemma-4) LLM 模型在 Nvidia A100 上配合不同參數的系統訊息,藉此讓自己理解不同 LLM Model 與配合 A100 產生的差異,提供一個低成本的解決方案。
GPT-OSS-20B 模型是我們的參考依據:
部署指令參考 vLLM 官方連結:
vllm serve openai/gpt-oss-20b --port 10000 --max-model-len 131072 --gpu-memory-utilization 0.65 --trust-remote-code
系統訊息透露幾個重要訊息:
Model loading took 13.72 GiB memory and 62.336108 seconds
Available KV cache memory: 36.56 GiB
quantization=mxfp4,這個預設值是為什麼 20B 的模型只需要 13.72 GiB 的內存的關鍵。
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:299]
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:299] █▄█▀ █ █ █ █ model openai/gpt-oss-20b
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:299]
(APIServer pid=6539) INFO 04-18 20:22:57 [utils.py:233] non-default args: {'model_tag': 'openai/gpt-oss-20b', 'port': 10000, 'model': 'openai/gpt-oss-20b', 'trust_remote_code': True, 'max_model_len': 131072, 'gpu_memory_utilization': 0.65}
config.json: 1.81kB [00:00, 5.73MB/s]
(APIServer pid=6539) INFO 04-18 20:23:04 [model.py:549] Resolved architecture: GptOssForCausalLM
model.safetensors.index.json: 36.4kB [00:00, 80.0MB/s]
Parse safetensors files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 12.56it/s]
(APIServer pid=6539) INFO 04-18 20:23:04 [model.py:1678] Using max model len 131072
(APIServer pid=6539) INFO 04-18 20:23:04 [config.py:131] Overriding max cuda graph capture size to 1024 for performance.
(APIServer pid=6539) INFO 04-18 20:23:04 [vllm.py:790] Asynchronous scheduling is enabled.
tokenizer_config.json: 4.20kB [00:00, 13.5MB/s]
tokenizer.json: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 27.9M/27.9M [00:00<00:00, 33.1MB/s]
special_tokens_map.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 98.0/98.0 [00:00<00:00, 660kB/s]
chat_template.jinja: 16.7kB [00:00, 60.7MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 177/177 [00:00<00:00, 952kB/s]
(EngineCore pid=6686) INFO 04-18 20:23:14 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='openai/gpt-oss-20b', speculative_config=None, tokenizer='openai/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=openai/gpt-oss-20b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=6686) WARNING 04-18 20:23:14 [network_utils.py:36] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software to interact with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
(EngineCore pid=6686) INFO 04-18 20:23:14 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.240.1.138:45759 backend=nccl
(EngineCore pid=6686) INFO 04-18 20:23:14 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=6686) INFO 04-18 20:23:15 [gpu_model_runner.py:4735] Starting to load model openai/gpt-oss-20b...
(EngineCore pid=6686) INFO 04-18 20:23:16 [cuda.py:334] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN'].
(EngineCore pid=6686) INFO 04-18 20:23:16 [mxfp4.py:352] Using 'MARLIN' Mxfp4 MoE backend.
(EngineCore pid=6686) INFO 04-18 20:24:15 [weight_utils.py:581] Time spent downloading weights for openai/gpt-oss-20b: 58.546043 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/3 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 33% Completed | 1/3 [00:00<00:01, 1.71it/s]
Loading safetensors checkpoint shards: 67% Completed | 2/3 [00:01<00:00, 1.50it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.55it/s]
Loading safetensors checkpoint shards: 100% Completed | 3/3 [00:01<00:00, 1.56it/s]
(EngineCore pid=6686)
(EngineCore pid=6686) INFO 04-18 20:24:17 [default_loader.py:384] Loading weights took 2.08 seconds
(EngineCore pid=6686) INFO 04-18 20:24:17 [mxfp4.py:836] Using MoEPrepareAndFinalizeNoDPEPModular
(EngineCore pid=6686) INFO 04-18 20:24:18 [gpu_model_runner.py:4820] Model loading took 13.72 GiB memory and 62.336108 seconds
(EngineCore pid=6686) INFO 04-18 20:24:23 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/53309fb7e3/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=6686) INFO 04-18 20:24:23 [backends.py:1111] Dynamo bytecode transform time: 4.82 s
(EngineCore pid=6686) INFO 04-18 20:24:26 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=6686) INFO 04-18 20:24:30 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 6.34 s
(EngineCore pid=6686) INFO 04-18 20:24:31 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/a7c56a287d6bc73f13d9858ad8e60b103c9a539350d9acfe4f98c0a15ebd82e3/rank_0_0/model
(EngineCore pid=6686) INFO 04-18 20:24:31 [monitor.py:48] torch.compile took 12.46 s in total
(EngineCore pid=6686) INFO 04-18 20:24:31 [monitor.py:76] Initial profiling/warmup run took 0.23 s
(EngineCore pid=6686) INFO 04-18 20:24:37 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=1024
(EngineCore pid=6686) INFO 04-18 20:24:38 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=83 (largest=1024), FULL=35 (largest=256)
(EngineCore pid=6686) INFO 04-18 20:24:40 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 0.42 GiB total
(EngineCore pid=6686) INFO 04-18 20:24:40 [gpu_worker.py:436] Available KV cache memory: 36.56 GiB
(EngineCore pid=6686) INFO 04-18 20:24:40 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.6500 to 0.6553 to maintain the same effective KV cache size.
(EngineCore pid=6686) INFO 04-18 20:24:40 [kv_cache_utils.py:1319] GPU KV cache size: 798,624 tokens
(EngineCore pid=6686) INFO 04-18 20:24:40 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 11.99x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 83/83 [00:05<00:00, 16.07it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:04<00:00, 8.20it/s]
(EngineCore pid=6686) INFO 04-18 20:24:50 [gpu_model_runner.py:6046] Graph capturing finished in 10 secs, took 0.69 GiB
(EngineCore pid=6686) INFO 04-18 20:24:50 [gpu_worker.py:597] CUDA graph pool memory: 0.69 GiB (actual), 0.42 GiB (estimated), difference: 0.26 GiB (38.4%).
(EngineCore pid=6686) INFO 04-18 20:24:50 [core.py:283] init engine (profile, create kv cache, warmup model) took 32.50 seconds
(EngineCore pid=6686) INFO 04-18 20:24:53 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=6539) INFO 04-18 20:24:53 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=6539) WARNING 04-18 20:24:54 [serving.py:233] For gpt-oss, we ignore --enable-auto-tool-choice and always enable tool use.
(APIServer pid=6539) INFO 04-18 20:24:58 [hf.py:314] Detected the chat template content format to be 'string'. You can set `--chat-template-content-format` to override this.
(APIServer pid=6539) INFO 04-18 20:24:58 [api_server.py:594] Starting vLLM server on http://0.0.0.0:10000
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:37] Available routes are:
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=6539) INFO 04-18 20:24:58 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=6539) INFO: Started server process [6539]
(APIServer pid=6539) INFO: Waiting for application startup.
(APIServer pid=6539) INFO: Application startup complete.
(APIServer pid=6539) INFO 04-18 20:25:48 [loggers.py:259] Engine 000: Avg prompt throughput: 8.0 tokens/s, Avg generation throughput: 111.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=6539) INFO: 127.0.0.1:39600 - "POST /v1/responses HTTP/1.1" 200 OK
(APIServer pid=6539) INFO 04-18 20:25:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 164.7 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=6539) INFO 04-18 20:26:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
Gemma-4-26B-A4B
部署指令參考 vLLM 官方網站:
vllm serve google/gemma-4-26B-A4B \
--port 10000 \
--max-model-len 131072 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template /home/vllm/examples/tool_chat_template_gemma4.jinja \
--gpu-memory-utilization 0.9
部署系統訊息:
Model loading took 48.5 GiB memory and 10.090812 seconds
Available KV cache memory: 21.5 GiB
(.venv) root@run-cvvmqstioi240l5-0:~# vllm serve google/gemma-4-26B-A4B --port 10000 --max-model-len 131072 --enforce-eager --enable-chunked-prefill --enable-auto-tool-choice --reasoning-parser gemma4 --tool-call-parser gemma4 --chat-template /home/vllm/examples/tool_chat_template_gemma4.jinja --gpu-memory-utilization 0.9
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:299]
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:299] █▄█▀ █ █ █ █ model google/gemma-4-26B-A4B
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:299]
(APIServer pid=10793) INFO 04-17 14:03:07 [utils.py:233] non-default args: {'model_tag': 'google/gemma-4-26B-A4B', 'chat_template': '/home/vllm/examples/tool_chat_template_gemma4.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'port': 10000, 'model': 'google/gemma-4-26B-A4B', 'max_model_len': 131072, 'enforce_eager': True, 'reasoning_parser': 'gemma4', 'enable_chunked_prefill': True}
(APIServer pid=10793) INFO 04-17 14:03:08 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=10793) INFO 04-17 14:03:08 [model.py:1678] Using max model len 131072
(APIServer pid=10793) INFO 04-17 14:03:08 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=10793) INFO 04-17 14:03:08 [vllm.py:790] Asynchronous scheduling is enabled.
(APIServer pid=10793) WARNING 04-17 14:03:08 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(APIServer pid=10793) WARNING 04-17 14:03:08 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(APIServer pid=10793) INFO 04-17 14:03:08 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=10793) INFO 04-17 14:03:08 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=10853) INFO 04-17 14:03:19 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='google/gemma-4-26B-A4B', speculative_config=None, tokenizer='google/gemma-4-26B-A4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-4-26B-A4B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.NONE: 0>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['all'], 'splitting_ops': [], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.NONE: 0>, 'cudagraph_num_of_warmups': 0, 'cudagraph_capture_sizes': [], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': True, 'fuse_act_quant': True, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 0, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=10853) WARNING 04-17 14:03:19 [network_utils.py:36] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software to interact with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
(EngineCore pid=10853) INFO 04-17 14:03:22 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.240.1.237:39671 backend=nccl
(EngineCore pid=10853) INFO 04-17 14:03:22 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=10853) INFO 04-17 14:03:23 [gpu_model_runner.py:4735] Starting to load model google/gemma-4-26B-A4B...
(EngineCore pid=10853) INFO 04-17 14:03:23 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=10853) WARNING 04-17 14:03:23 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=10853) WARNING 04-17 14:03:23 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=10853) INFO 04-17 14:03:23 [vllm.py:1025] Cudagraph is disabled under eager mode
(EngineCore pid=10853) INFO 04-17 14:03:23 [compilation.py:290] Enabled custom fusions: norm_quant, act_quant
(EngineCore pid=10853) INFO 04-17 14:03:23 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=10853) INFO 04-17 14:03:23 [unquantized.py:186] Using TRITON backend for Unquantized MoE
(EngineCore pid=10853) INFO 04-17 14:03:23 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.77s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 3.93s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:08<00:00, 4.36s/it]
(EngineCore pid=10853)
(EngineCore pid=10853) INFO 04-17 14:03:33 [default_loader.py:384] Loading weights took 8.77 seconds
(EngineCore pid=10853) INFO 04-17 14:03:33 [gpu_model_runner.py:4820] Model loading took 48.5 GiB memory and 10.090812 seconds
(EngineCore pid=10853) INFO 04-17 14:03:34 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 2496 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=10853) WARNING 04-17 14:03:50 [fused_moe.py:1090] Using default MoE config. Performance might be sub-optimal! Config file not found at /home/.venv/lib/python3.12/site-packages/vllm/model_executor/layers/fused_moe/configs/E=128,N=704,device_name=NVIDIA_A100_80GB_PCIe.json
(EngineCore pid=10853) INFO 04-17 14:03:51 [gpu_worker.py:436] Available KV cache memory: 21.5 GiB
(EngineCore pid=10853) INFO 04-17 14:03:51 [kv_cache_utils.py:1319] GPU KV cache size: 93,936 tokens
(EngineCore pid=10853) INFO 04-17 14:03:51 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 6.96x
(EngineCore pid=10853) INFO 04-17 14:03:51 [core.py:283] init engine (profile, create kv cache, warmup model) took 17.49 seconds
(EngineCore pid=10853) WARNING 04-17 14:03:51 [vllm.py:848] Enforce eager set, disabling torch.compile and CUDAGraphs. This is equivalent to setting -cc.mode=none -cc.cudagraph_mode=none
(EngineCore pid=10853) WARNING 04-17 14:03:51 [vllm.py:859] Inductor compilation was disabled by user settings, optimizations settings that are only active during inductor compilation will be ignored.
(EngineCore pid=10853) INFO 04-17 14:03:51 [vllm.py:1025] Cudagraph is disabled under eager mode
(APIServer pid=10793) INFO 04-17 14:03:51 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=10793) INFO 04-17 14:03:51 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=10793) WARNING 04-17 14:03:51 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=10793) INFO 04-17 14:03:52 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=10793) INFO 04-17 14:04:07 [base.py:231] Multi-modal warmup completed in 15.030s
(APIServer pid=10793) INFO 04-17 14:04:07 [api_server.py:594] Starting vLLM server on http://0.0.0.0:10000
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:37] Available routes are:
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=10793) INFO 04-17 14:04:07 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=10793) INFO: Started server process [10793]
(APIServer pid=10793) INFO: Waiting for application startup.
(APIServer pid=10793) INFO: Application startup complete.
(APIServer pid=10793) INFO 04-17 14:04:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.8 tokens/s, Avg generation throughput: 1.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:04:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.3%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:04:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:04:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:04:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:05:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:05:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:05:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:05:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:05:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:05:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:06:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:06:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:06:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:06:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:06:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:06:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:07:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.2%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:07:18 [loggers.py:259] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 17.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:07:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:07:38 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.4%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:07:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.6%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:07:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.3 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.8%, Prefix cache hit rate: 0.0%
(APIServer pid=10793) INFO 04-17 14:08:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.4 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.0%, Prefix cache hit rate: 0.0%
(EngineCore pid=10853) INFO 04-17 14:08:10 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=10853) INFO 04-17 14:08:10 [core.py:1215] Aborting 1 requests
(EngineCore pid=10853) INFO 04-17 14:08:10 [core.py:1233] Shutdown complete
可以看到部署 Gemma-4-26B-A4B 已經佔據了大部分的記憶體資源非常耗費資源,而且產生 tokens 的速度也不高每秒大約 20 tokens,此時可以考慮使用 quantization fp8 如下圖所示可以降低一半的記憶體使用量,並且將 –max-model-len 設到 256K 可以充分發揮 Gemma 長上下文的優勢。
vllm serve google/gemma-4-26B-A4B \
--port 10000 \
--max-model-len 256K \
--quantization fp8 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template /root/vllm/examples/tool_chat_template_gemma4.jinja \
--gpu-memory-utilization 0.65
觀察系統 Log:
Model loading took 25.7 GiB memory and 15.197657 seconds
Available KV cache memory: 24.45 GiB
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:299]
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:299] █▄█▀ █ █ █ █ model google/gemma-4-26B-A4B
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:299]
(APIServer pid=6048) INFO 04-18 20:14:28 [utils.py:233] non-default args: {'model_tag': 'google/gemma-4-26B-A4B', 'chat_template': '/root/vllm/examples/tool_chat_template_gemma4.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'port': 10000, 'model': 'google/gemma-4-26B-A4B', 'max_model_len': 262144, 'quantization': 'fp8', 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.65}
(APIServer pid=6048) INFO 04-18 20:14:29 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=6048) INFO 04-18 20:14:29 [model.py:1678] Using max model len 262144
(APIServer pid=6048) INFO 04-18 20:14:29 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=6048) INFO 04-18 20:14:29 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=6116) INFO 04-18 20:14:40 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='google/gemma-4-26B-A4B', speculative_config=None, tokenizer='google/gemma-4-26B-A4B', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-4-26B-A4B, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=6116) WARNING 04-18 20:14:40 [network_utils.py:36] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software to interact with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
(EngineCore pid=6116) INFO 04-18 20:14:43 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.240.1.138:52415 backend=nccl
(EngineCore pid=6116) INFO 04-18 20:14:43 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0, EPLB rank N/A
(EngineCore pid=6116) INFO 04-18 20:14:43 [gpu_model_runner.py:4735] Starting to load model google/gemma-4-26B-A4B...
(EngineCore pid=6116) INFO 04-18 20:14:44 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=6116) INFO 04-18 20:14:44 [__init__.py:261] Selected MarlinFP8ScaledMMLinearKernel for Fp8OnlineLinearMethod
(EngineCore pid=6116) INFO 04-18 20:14:44 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=6116) INFO 04-18 20:14:44 [fp8.py:396] Using MARLIN Fp8 MoE backend out of potential backends: ['AITER', 'FLASHINFER_TRTLLM', 'FLASHINFER_CUTLASS', 'DEEPGEMM', 'TRITON', 'MARLIN', 'BATCHED_DEEPGEMM', 'BATCHED_TRITON', 'XPU'].
(EngineCore pid=6116) INFO 04-18 20:14:44 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
(EngineCore pid=6116) WARNING 04-18 20:14:46 [marlin_utils_fp8.py:216] Your GPU does not have native support for FP8 computation but FP8 quantization is being used. Weight-only FP8 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.
(EngineCore pid=6116) INFO 04-18 20:14:46 [fp8.py:560] Using MoEPrepareAndFinalizeNoDPEPModular
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:08<00:08, 8.47s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:13<00:00, 6.36s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:13<00:00, 6.68s/it]
(EngineCore pid=6116)
(EngineCore pid=6116) INFO 04-18 20:14:59 [default_loader.py:384] Loading weights took 13.51 seconds
(EngineCore pid=6116) INFO 04-18 20:14:59 [gpu_model_runner.py:4820] Model loading took 25.7 GiB memory and 15.197657 seconds
(EngineCore pid=6116) INFO 04-18 20:15:00 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 2496 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=6116) INFO 04-18 20:15:22 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/ed28a1082a/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=6116) INFO 04-18 20:15:22 [backends.py:1111] Dynamo bytecode transform time: 3.74 s
(EngineCore pid=6116) INFO 04-18 20:15:24 [backends.py:285] Directly load the compiled graph(s) for compile range (1, 2048) from the cache, took 1.461 s
(EngineCore pid=6116) INFO 04-18 20:15:24 [decorators.py:303] Directly load AOT compilation from path /root/.cache/vllm/torch_compile_cache/torch_aot_compile/cf7df09e7e9c2bd5fb3ccbf0eeff6a2bc7dbbc8d6ea39a535b47ecd6869e8076/rank_0_0/model
(EngineCore pid=6116) INFO 04-18 20:15:24 [monitor.py:48] torch.compile took 5.71 s in total
(EngineCore pid=6116) INFO 04-18 20:15:24 [monitor.py:76] Initial profiling/warmup run took 0.67 s
(EngineCore pid=6116) INFO 04-18 20:15:25 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=6116) INFO 04-18 20:15:25 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=6116) INFO 04-18 20:15:26 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 1.11 GiB total
(EngineCore pid=6116) INFO 04-18 20:15:27 [gpu_worker.py:436] Available KV cache memory: 24.45 GiB
(EngineCore pid=6116) INFO 04-18 20:15:27 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.6500 to 0.6640 to maintain the same effective KV cache size.
(EngineCore pid=6116) INFO 04-18 20:15:27 [kv_cache_utils.py:1319] GPU KV cache size: 106,800 tokens
(EngineCore pid=6116) INFO 04-18 20:15:27 [kv_cache_utils.py:1324] Maximum concurrency for 262,144 tokens per request: 4.37x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:04<00:00, 10.39it/s]
Capturing CUDA graphs (decode, FULL): 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:03<00:00, 9.96it/s]
(EngineCore pid=6116) INFO 04-18 20:15:36 [gpu_model_runner.py:6046] Graph capturing finished in 9 secs, took 0.95 GiB
(EngineCore pid=6116) INFO 04-18 20:15:36 [gpu_worker.py:597] CUDA graph pool memory: 0.95 GiB (actual), 1.11 GiB (estimated), difference: 0.16 GiB (16.6%).
(EngineCore pid=6116) INFO 04-18 20:15:36 [core.py:283] init engine (profile, create kv cache, warmup model) took 36.33 seconds
(APIServer pid=6048) INFO 04-18 20:15:36 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=6048) INFO 04-18 20:15:36 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=6048) WARNING 04-18 20:15:36 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=6048) INFO 04-18 20:15:37 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=6048) INFO 04-18 20:15:52 [base.py:231] Multi-modal warmup completed in 15.028s
(APIServer pid=6048) INFO 04-18 20:15:52 [api_server.py:594] Starting vLLM server on http://0.0.0.0:10000
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:37] Available routes are:
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=6048) INFO 04-18 20:15:52 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=6048) INFO: Started server process [6048]
(APIServer pid=6048) INFO: Waiting for application startup.
(APIServer pid=6048) INFO: Application startup complete.
(APIServer pid=6048) INFO 04-18 20:17:42 [loggers.py:259] Engine 000: Avg prompt throughput: 2.5 tokens/s, Avg generation throughput: 14.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.2%, Prefix cache hit rate: 0.0%
(APIServer pid=6048) INFO: 127.0.0.1:50130 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=6048) INFO 04-18 20:17:52 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 80.6 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=6048) INFO 04-18 20:18:02 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
^C(EngineCore pid=6116) INFO 04-18 20:19:20 [core.py:1210] Shutdown initiated (timeout=0)
(EngineCore pid=6116) INFO 04-18 20:19:20 [core.py:1233] Shutdown complete
(APIServer pid=6048) INFO: Shutting down
備註:tokens 產生速率有所提升。
Gemma-4-31B
部署指令:
vllm serve google/gemma-4-31b-it \
--port 10000 \
--max-model-len 131072 \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--chat-template /home/vllm/examples/tool_chat_template_gemma4.jinja \
--gpu-memory-utilization 0.95
以下系統訊息顯示:
Model loading took 58.9 GiB memory and 265.606047 seconds
Available KV cache memory: 15.18 GiB
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:299]
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:299] █ █ █▄ ▄█
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:299] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.19.0
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:299] █▄█▀ █ █ █ █ model google/gemma-4-31B-it
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:299] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:299]
(APIServer pid=12007) INFO 04-17 14:22:59 [utils.py:233] non-default args: {'model_tag': 'google/gemma-4-31B-it', 'chat_template': '/home/vllm/examples/tool_chat_template_gemma4.jinja', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'port': 10000, 'model': 'google/gemma-4-31B-it', 'max_model_len': 131072, 'reasoning_parser': 'gemma4', 'gpu_memory_utilization': 0.95, 'enable_chunked_prefill': True}
(APIServer pid=12007) INFO 04-17 14:23:00 [model.py:549] Resolved architecture: Gemma4ForConditionalGeneration
(APIServer pid=12007) INFO 04-17 14:23:00 [model.py:1678] Using max model len 131072
(APIServer pid=12007) INFO 04-17 14:23:00 [config.py:104] Gemma4 model has heterogeneous head dimensions (head_dim=256, global_head_dim=512). Forcing TRITON_ATTN backend to prevent mixed-backend numerical divergence.
(APIServer pid=12007) INFO 04-17 14:23:00 [vllm.py:790] Asynchronous scheduling is enabled.
tokenizer_config.json: 2.10kB [00:00, 13.9MB/s]
tokenizer.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32.2M/32.2M [00:00<00:00, 38.8MB/s]
chat_template.jinja: 16.4kB [00:00, 60.6MB/s]
generation_config.json: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 208/208 [00:00<00:00, 1.26MB/s]
(EngineCore pid=12103) INFO 04-17 14:23:12 [core.py:105] Initializing a V1 LLM engine (v0.19.0) with config: model='google/gemma-4-31B-it', speculative_config=None, tokenizer='google/gemma-4-31B-it', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=131072, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, decode_context_parallel_size=1, dcp_comm_backend=ag_rs, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, enable_return_routed_experts=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='gemma4', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False, enable_mfu_metrics=False, enable_mm_processor_stats=False, enable_logging_iteration_details=False), seed=0, served_model_name=google/gemma-4-31B-it, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'mode': <CompilationMode.VLLM_COMPILE: 3>, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::olmo_hybrid_gdn_full_forward', 'vllm::kda_attention', 'vllm::sparse_attn_indexer', 'vllm::rocm_aiter_sparse_attn_indexer', 'vllm::unified_kv_cache_update', 'vllm::unified_mla_kv_cache_update'], 'compile_mm_encoder': False, 'cudagraph_mm_encoder': False, 'encoder_cudagraph_token_budgets': [], 'encoder_cudagraph_max_images_per_batch': 0, 'compile_sizes': [], 'compile_ranges_endpoints': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'size_asserts': False, 'alignment_asserts': False, 'scalar_asserts': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': <CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)>, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': <DynamicShapesType.BACKED: 'backed'>, 'evaluate_guards': False, 'assume_32_bit_indexing': False}, 'local_cache_dir': None, 'fast_moe_cold_start': True, 'static_all_moe_layers': []}
(EngineCore pid=12103) WARNING 04-17 14:23:12 [network_utils.py:36] The environment variable HOST_IP is deprecated and ignored, as it is often used by Docker and other software to interact with the container's network stack. Please use VLLM_HOST_IP instead to set the IP address for vLLM processes to communicate with each other.
(EngineCore pid=12103) INFO 04-17 14:23:15 [parallel_state.py:1400] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp://10.240.1.237:44807 backend=nccl
(EngineCore pid=12103) INFO 04-17 14:23:15 [parallel_state.py:1716] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank N/A, EPLB rank N/A
(EngineCore pid=12103) INFO 04-17 14:23:16 [gpu_model_runner.py:4735] Starting to load model google/gemma-4-31B-it...
(EngineCore pid=12103) INFO 04-17 14:23:16 [vllm.py:790] Asynchronous scheduling is enabled.
(EngineCore pid=12103) INFO 04-17 14:23:16 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
(EngineCore pid=12103) INFO 04-17 14:23:16 [cuda.py:274] Using AttentionBackendEnum.TRITON_ATTN backend.
model.safetensors.index.json: 120kB [00:00, 172MB/s]
(EngineCore pid=12103) INFO 04-17 14:27:31 [weight_utils.py:581] Time spent downloading weights for google/gemma-4-31B-it: 254.377387 seconds
Loading safetensors checkpoint shards: 0% Completed | 0/2 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 50% Completed | 1/2 [00:06<00:06, 6.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00, 4.42s/it]
Loading safetensors checkpoint shards: 100% Completed | 2/2 [00:09<00:00, 4.66s/it]
(EngineCore pid=12103)
(EngineCore pid=12103) INFO 04-17 14:27:41 [default_loader.py:384] Loading weights took 9.81 seconds
(EngineCore pid=12103) INFO 04-17 14:27:42 [gpu_model_runner.py:4820] Model loading took 58.9 GiB memory and 265.606047 seconds
(EngineCore pid=12103) INFO 04-17 14:27:42 [gpu_model_runner.py:5753] Encoder cache will be initialized with a budget of 2496 tokens, and profiled with 1 video items of the maximum feature size.
(EngineCore pid=12103) INFO 04-17 14:28:12 [backends.py:1051] Using cache directory: /root/.cache/vllm/torch_compile_cache/7adb633e8c/rank_0_0/backbone for vLLM's torch.compile
(EngineCore pid=12103) INFO 04-17 14:28:12 [backends.py:1111] Dynamo bytecode transform time: 13.96 s
(EngineCore pid=12103) INFO 04-17 14:28:20 [backends.py:372] Cache the graph of compile range (1, 2048) for later use
(EngineCore pid=12103) INFO 04-17 14:28:37 [backends.py:390] Compiling a graph for compile range (1, 2048) takes 24.11 s
(EngineCore pid=12103) INFO 04-17 14:28:41 [decorators.py:640] saved AOT compiled function to /root/.cache/vllm/torch_compile_cache/torch_aot_compile/d173ae923688602925c8eea81edef979ba093c96f1721ef4b4e7b1eff5b17f9b/rank_0_0/model
(EngineCore pid=12103) INFO 04-17 14:28:41 [monitor.py:48] torch.compile took 43.47 s in total
(EngineCore pid=12103) INFO 04-17 14:28:43 [monitor.py:76] Initial profiling/warmup run took 1.02 s
(EngineCore pid=12103) INFO 04-17 14:28:43 [kv_cache_utils.py:829] Overriding num_gpu_blocks=0 with num_gpu_blocks_override=512
(EngineCore pid=12103) INFO 04-17 14:28:43 [gpu_model_runner.py:5876] Profiling CUDA graph memory: PIECEWISE=51 (largest=512), FULL=35 (largest=256)
(EngineCore pid=12103) INFO 04-17 14:28:47 [gpu_model_runner.py:5955] Estimated CUDA graph memory: 0.86 GiB total
(EngineCore pid=12103) INFO 04-17 14:28:48 [gpu_worker.py:436] Available KV cache memory: 15.18 GiB
(EngineCore pid=12103) INFO 04-17 14:28:48 [gpu_worker.py:470] In v0.19, CUDA graph memory profiling will be enabled by default (VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1), which more accurately accounts for CUDA graph memory during KV cache allocation. To try it now, set VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1 and increase --gpu-memory-utilization from 0.9500 to 0.9608 to maintain the same effective KV cache size.
(EngineCore pid=12103) INFO 04-17 14:28:48 [kv_cache_utils.py:1319] GPU KV cache size: 16,576 tokens
(EngineCore pid=12103) INFO 04-17 14:28:48 [kv_cache_utils.py:1324] Maximum concurrency for 131,072 tokens per request: 1.23x
Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 51/51 [00:06<00:00, 7.97it/s]
Capturing CUDA graphs (decode, FULL): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 35/35 [00:06<00:00, 5.79it/s]
(EngineCore pid=12103) INFO 04-17 14:29:01 [gpu_model_runner.py:6046] Graph capturing finished in 13 secs, took 0.84 GiB
(EngineCore pid=12103) INFO 04-17 14:29:01 [gpu_worker.py:597] CUDA graph pool memory: 0.84 GiB (actual), 0.86 GiB (estimated), difference: 0.02 GiB (2.6%).
(EngineCore pid=12103) INFO 04-17 14:29:01 [core.py:283] init engine (profile, create kv cache, warmup model) took 79.33 seconds
(APIServer pid=12007) INFO 04-17 14:29:02 [api_server.py:590] Supported tasks: ['generate']
(APIServer pid=12007) INFO 04-17 14:29:02 [parser_manager.py:202] "auto" tool choice has been enabled.
(APIServer pid=12007) WARNING 04-17 14:29:02 [model.py:1435] Default vLLM sampling parameters have been overridden by the model's `generation_config.json`: `{'temperature': 1.0, 'top_k': 64, 'top_p': 0.95}`. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.
(APIServer pid=12007) INFO 04-17 14:29:02 [hf.py:314] Detected the chat template content format to be 'openai'. You can set `--chat-template-content-format` to override this.
(APIServer pid=12007) INFO 04-17 14:29:17 [base.py:231] Multi-modal warmup completed in 15.017s
(APIServer pid=12007) INFO 04-17 14:29:18 [api_server.py:594] Starting vLLM server on http://0.0.0.0:10000
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:37] Available routes are:
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /openapi.json, Methods: GET, HEAD
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /docs, Methods: GET, HEAD
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /docs/oauth2-redirect, Methods: GET, HEAD
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /redoc, Methods: GET, HEAD
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /tokenize, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /detokenize, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /load, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /version, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /health, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /metrics, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/models, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /ping, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /ping, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /invocations, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/chat/completions, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/chat/completions/batch, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/responses, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/responses/{response_id}, Methods: GET
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/responses/{response_id}/cancel, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/completions, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/messages, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/messages/count_tokens, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /inference/v1/generate, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /scale_elastic_ep, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /is_scaling_elastic_ep, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/chat/completions/render, Methods: POST
(APIServer pid=12007) INFO 04-17 14:29:18 [launcher.py:46] Route: /v1/completions/render, Methods: POST
(APIServer pid=12007) INFO: Started server process [12007]
(APIServer pid=12007) INFO: Waiting for application startup.
(APIServer pid=12007) INFO: Application startup complete.
(APIServer pid=12007) INFO 04-17 14:29:38 [loggers.py:259] Engine 000: Avg prompt throughput: 2.1 tokens/s, Avg generation throughput: 9.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.7%, Prefix cache hit rate: 0.0%
(APIServer pid=12007) INFO 04-17 14:29:48 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 23.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 1.9%, Prefix cache hit rate: 0.0%
(APIServer pid=12007) INFO 04-17 14:29:58 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 3.3%, Prefix cache hit rate: 0.0%
(APIServer pid=12007) INFO 04-17 14:30:08 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 22.9 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 4.5%, Prefix cache hit rate: 0.0%
(APIServer pid=12007) INFO: 127.0.0.1:42834 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(APIServer pid=12007) INFO 04-17 14:30:18 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.5 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
(APIServer pid=12007) INFO 04-17 14:30:28 [loggers.py:259] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%