{"id":10234,"date":"2025-12-27T17:12:54","date_gmt":"2025-12-27T16:12:54","guid":{"rendered":"https:\/\/myoceane.fr\/?p=10234"},"modified":"2025-12-27T17:13:33","modified_gmt":"2025-12-27T16:13:33","slug":"llm-spark-local-vllm-server","status":"publish","type":"post","link":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/","title":{"rendered":"[LLM] Spark + Local vLLM Server"},"content":{"rendered":"<div id=\"fb-root\"><\/div>\n\n<p style=\"text-align: justify;\">\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0<a href=\"https:\/\/developer.nvidia.com\/blog\/accelerate-deep-learning-and-llm-inference-with-apache-spark-in-the-cloud\/\">Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud<\/a>\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002<\/p>\n\n\n\n<p style=\"text-align: justify;\">\u9700\u8981\u5148\u5b89\u88dd vLLM Server \u5728\u96f2\u7aef\u7684Docker \u6620\u50cf\u6a94\u4e2d\uff0c\u5b89\u88dd vllm \u53ef\u4ee5<a href=\"https:\/\/docs.vllm.ai\/en\/latest\/getting_started\/installation\/gpu\/#build-wheel-from-source\">\u53c3\u8003\u9023\u7d50<\/a>\uff0c\u57fa\u672c\u4e0a\u5c31\u662f\u57f7\u884c\u4ee5\u4e0b\u7684\u6307\u4ee4\uff1a<\/p>\n<pre class=\"lang:bash\">git clone https:\/\/github.com\/vllm-project\/vllm.git\ncd vllm\nVLLM_USE_PRECOMPILED=1 pip3 install --editable .<\/pre>\n<p>Note: \u7279\u5225\u9700\u8981\u6ce8\u610f pyspark \u5728\u6574\u500b Spark \u7684 python \u7248\u672c\u5fc5\u9808\u8981\u8ddf vLLM \u4e00\u81f4\uff0c\u56e0\u6b64\u5efa\u8b70\u4e0d\u8981\u4f7f\u7528 uv \u5b89\u88dd\u3002<\/p>\n\n\n\n<h4>\u57f7\u884c Spark Rapids PySpark \u8173\u672c<\/h4>\n<p>\u53ef\u4ee5\u53c3\u8003\u57f7\u884c Spark Rapids <a href=\"https:\/\/github.com\/NVIDIA\/spark-rapids-examples\/tree\/main\/examples\/ML%2BDL-Examples\/Spark-DL\/dl_inference\/vllm\">\u63d0\u4f9b\u7684\u8173\u672c<\/a>\u5c07 vLLM \u8dd1\u5728 Spark \u53e2\u96c6\u7684\u6bcf\u4e00\u53f0\u6a5f\u5668\u4e0a\u3002<\/p>\n\n\n\n<p style=\"text-align: justify;\">\u4ee5\u4e0b\u7684\u932f\u8aa4\u8a0a\u606f\u6703\u767c\u751f\uff0c\u4e3b\u8981\u662f\u56e0\u70ba\u5982\u679c\u4f7f\u7528 spark.dynamicAllocation \u4e00\u958b\u59cb\u6c92\u6709 executor\uff0c\u6703\u5c0e\u81f4 num_executor \u70ba\u96f6\uff0c\u56e0\u6b64\u9700\u8981\u4ee5\u4e0b\u5169\u500b Spark Conf \u8a2d\u5b9a\uff1a<\/p>\n<pre class=\"lang:bash\">spark.dynamicAllocation.enabled false\nspark.executor.instances 1<\/pre>\n<p style=\"text-align: justify;\">\u8a2d\u5b9a\u81f3\u5c11\u4e00\u500b executor\uff0c\u9019\u908a\u4e0d\u80fd\u5920\u4f7f\u7528 dynamicAllocation\u3002<\/p>\n<pre class=\"lang:bash\">---------------------------------------------------------------------------\nZeroDivisionError                         Traceback (most recent call last)\nCell In[7], line 1\n----&gt; 1 server_manager.start_servers(tensor_parallel_size=tensor_parallel_size,\n      2                              gpu_memory_utilization=0.95,\n      3                              max_model_len=6600,\n      4                              #task=\"generate\",\n      5                              wait_retries=100)\n\nFile \/mnt\/spark-a9c7f16d-f4e5-47d3-8c97-7221d7a3bc2d\/userFiles-96583b4a-6922-45ae-a333-0f3a55dc9b8f\/server_utils.py:566, in VLLMServerManager.start_servers(self, wait_retries, wait_timeout, **kwargs)\n    551 \"\"\"\n    552 Start vLLM OpenAI-compatible servers across the cluster.\n    553 \n   (...)\n    562     Dictionary of hostname -&gt; (server PID, [port])\n    563 \"\"\"\n    564 self._validate_vllm_kwargs(kwargs)\n--&gt; 566 return super().start_servers(\n    567     start_server_fn=_start_vllm_server_task,\n    568     wait_retries=wait_retries,\n    569     wait_timeout=wait_timeout,\n    570     **kwargs,\n    571 )\n\nFile \/mnt\/spark-a9c7f16d-f4e5-47d3-8c97-7221d7a3bc2d\/userFiles-96583b4a-6922-45ae-a333-0f3a55dc9b8f\/server_utils.py:373, in ServerManager.start_servers(self, start_server_fn, wait_retries, wait_timeout, **kwargs)\n    354 def start_servers(\n    355     self,\n    356     start_server_fn: Callable,\n   (...)\n    359     **kwargs,\n    360 ) -&gt; Dict[str, Tuple[int, List[int]]]:\n    361     \"\"\"\n    362     Start servers across the cluster.\n    363 \n   (...)\n    371         Dictionary of hostname -&gt; (server PID, [ports])\n    372     \"\"\"\n--&gt; 373     node_rdd = self._get_node_rdd()\n    374     model_name = self.model_name\n    375     model_path = self.model_path\n\nFile \/mnt\/spark-a9c7f16d-f4e5-47d3-8c97-7221d7a3bc2d\/userFiles-96583b4a-6922-45ae-a333-0f3a55dc9b8f\/server_utils.py:313, in ServerManager._get_node_rdd(self)\n    311 \"\"\"Create and configure RDD with stage-level scheduling for 1 task per executor.\"\"\"\n    312 sc = self.spark.sparkContext\n--&gt; 313 node_rdd = sc.parallelize(list(range(self.num_executors)), self.num_executors)\n    314 return self._use_stage_level_scheduling(node_rdd)\n\nFile \/usr\/local\/lib\/python3.10\/dist-packages\/pyspark\/context.py:812, in SparkContext.parallelize(self, c, numSlices)\n    809 if \"__len__\" not in dir(c):\n    810     c = list(c)  # Make it a list so we can compute its length\n    811 batchSize = max(\n--&gt; 812     1, min(len(c) \/\/ numSlices, self._batchSize or 1024)  # type: ignore[arg-type]\n    813 )\n    814 serializer = BatchedSerializer(self._unbatched_serializer, batchSize)\n    816 def reader_func(temp_filename: str) -&gt; JavaObject:\n\nZeroDivisionError: integer division or modulo by zero<\/pre>\n<p>\u53e6\u5916\u53c3\u8003 <a href=\"https:\/\/github.com\/NVIDIA\/spark-rapids-examples\/blob\/main\/examples\/ML%2BDL-Examples\/Spark-DL\/dl_inference\/vllm\/qwen-2.5-14b-tensor-parallel_vllm.ipynb\">Notebook \u7bc4\u4f8b<\/a> \u9700\u8981\u53e6\u5916\u8a2d\u5b9a\u5169\u500b Spark Configuration<\/p>\n<pre class=\"lang:bash\">spark.sql.execution.arrow.pyspark.enabled true\nspark.python.worker.reuse true<\/pre>\n<p style=\"text-align: justify;\">\u63a5\u4e0b\u4f86\u53c8\u767c\u751f\u5176\u4ed6\u7684\u57f7\u884c\u932f\u8aa4\u5982\u4ee5\u4e0b\u6240\u793a\uff0c\u770b\u8d77\u4f86\u4e3b\u8981\u662f\u56e0\u70ba vLLM \u555f\u52d5\u5931\u6557\uff0c\u9019\u6642\u5019\u8981\u53bb\u67e5\u770b executor logs \u624d\u6703\u77e5\u9053\u8a73\u7d30\u7684\u554f\u984c\u662f\u4ec0\u9ebc\uff1f<\/p>\n<pre class=\"lang:bash\">2025-12-14 16:19:03,639 - INFO - Requesting stage-level resources: (cores=24, gpu=1.0)\n2025-12-14 16:19:03,641 - INFO - Starting 1 VLLM servers.\n16:19:28.949 WARN  TaskSetManager - Lost task 0.0 in stage 2.0 (TID 2) (10.0.0.100 executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):\n  File \"\/home\/spark-current\/python\/lib\/pyspark.zip\/pyspark\/worker.py\", line 1247, in main\n    process()\n  File \"\/home\/spark-current\/python\/lib\/pyspark.zip\/pyspark\/worker.py\", line 1237, in process\n    out_iter = func(split_index, iterator)\n  File \"\/usr\/local\/lib\/python3.10\/dist-packages\/pyspark\/rdd.py\", line 5342, in func\n    return f(iterator)\n  File \"\/mnt\/spark-a3a762f1-3585-4202-b24f-68af898fe4d2\/userFiles-4048e0dd-5462-4ead-a31e-4fe2484d3178\/server_utils.py\", line 393, in &lt;lambda&gt;\n    .mapPartitions(lambda _: start_server_fn(**start_args))\n  File \"\/mnt\/spark-current\/work\/app-20251214161818-0015\/0\/server_utils.py\", line 220, in _start_vllm_server_task\n    raise TimeoutError(\nTimeoutError: Failure: vLLM server startup failed or timed out. Check the executor logs for more info.\n\n\tat org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:572)\n\tat org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:784)\n\tat org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:766)\n\tat org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:525)\n\tat org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)\n\tat scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)\n\tat scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)\n\tat scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)\n\tat scala.collection.TraversableOnce.to(TraversableOnce.scala:366)\n\tat scala.collection.TraversableOnce.to$(TraversableOnce.scala:364)\n\tat org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)\n\tat scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:358)\n\tat scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:358)\n\tat org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)\n\tat scala.collection.TraversableOnce.toArray(TraversableOnce.scala:345)\n\tat scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:339)\n\tat org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1049)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2433)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:93)\n\tat org.apache.spark.TaskContext.runTaskWithListeners(TaskContext.scala:166)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:141)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$4(Executor.scala:620)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally(SparkErrorUtils.scala:64)\n\tat org.apache.spark.util.SparkErrorUtils.tryWithSafeFinally$(SparkErrorUtils.scala:61)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:94)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:623)\n\tat java.base\/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)\n\tat java.base\/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)\n\tat java.base\/java.lang.Thread.run(Thread.java:840)<\/pre>\n\n\n\n<p style=\"text-align: justify;\">\u5f9e executor log \u767c\u73fe\u5931\u6557\u7684\u539f\u56e0\u662f\u56e0\u70ba Memory \u4e0d\u8db3\uff0c\u8abf\u6574 Memory \u6bd4\u4f8b\u4e4b\u5f8c\u6210\u529f\u5c07 vLLM \u958b\u555f\u4e4b\u5f8c\uff0c\u5728 stderr \u88e1\u9762\u5448\u73fe\u7684 log\uff0c\u53ef\u4ee5\u770b\u5230 Qwen \u6a21\u578b\u6709\u6210\u529f\u88ab\u8dd1\u8d77\u4f86\u3002<\/p>\n<pre class=\"lang:bash\">21:11:48.122 INFO  SpillFramework - Initialized SpillFramework. Host spill store max size is: 3221225472 B.\n21:11:48.127 INFO  AwsStorageExecutorPlugin - Initializing S3 Plugin on the Executor 0\n21:11:48.139 INFO  ExecutorPluginContainer - Initialized executor component for plugin com.nvidia.spark.SQLPlugin.\n21:11:48.170 INFO  CoarseGrainedExecutorBackend - Got assigned task 0\n21:11:48.177 INFO  Executor - Running task 0.0 in stage 0.0 (TID 0)\n21:11:48.190 INFO  Executor - Fetching spark:\/\/df47c4ef02f441a691ec50ef8a32a916000000.internal.cloudapp.net:35733\/files\/server_utils.py with timestamp 1765746697379\n21:11:48.198 INFO  Utils - Fetching spark:\/\/df47c4ef02f441a691ec50ef8a32a916000000.internal.cloudapp.net:35733\/files\/server_utils.py to \/mnt\/spark-bb1e5efc-b0a5-48f2-b12f-74c52c675d98\/executor-889e4410-90bd-4ccb-91e4-0369221f9874\/spark-43ccec49-e90b-4b65-98f6-2d4e1366de24\/fetchFileTemp1858766223110752632.tmp\n21:11:48.200 INFO  Utils - Copying \/mnt\/spark-bb1e5efc-b0a5-48f2-b12f-74c52c675d98\/executor-889e4410-90bd-4ccb-91e4-0369221f9874\/spark-43ccec49-e90b-4b65-98f6-2d4e1366de24\/21140649291765746697379_cache to \/mnt\/spark-current\/work\/app-20251214211135-0004\/0\/.\/server_utils.py\n21:11:48.411 INFO  SparkResourceAdaptor - startDedicatedTaskThread: threadId: 125108656997952, task id: 0\n21:11:48.419 INFO  TorrentBroadcast - Started reading broadcast variable 0 with 1 pieces (estimated total size 4.0 MiB)\n21:11:48.446 INFO  TransportClientFactory - Successfully created connection to df47c4ef02f441a691ec50ef8a32a916000000.internal.cloudapp.net\/10.0.0.100:37915 after 1 ms (0 ms spent in bootstraps)\n21:11:48.475 INFO  MemoryStore - Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.9 KiB, free 119.8 GiB)\n21:11:48.482 INFO  TorrentBroadcast - Reading broadcast variable 0 took 62 ms\n21:11:48.508 INFO  MemoryStore - Block broadcast_0 stored as values in memory (estimated size 6.1 KiB, free 119.8 GiB)\n21:11:54.544 INFO  PythonRunner - Times: total = 5960, boot = 416, init = 150, finish = 5394\n21:11:54.554 INFO  Executor - Finished task 0.0 in stage 0.0 (TID 0). 5605 bytes result sent to driver\n21:11:54.669 INFO  CoarseGrainedExecutorBackend - Got assigned task 1\n21:11:54.669 INFO  Executor - Running task 0.0 in stage 1.0 (TID 1)\n21:11:54.675 INFO  SparkResourceAdaptor - startDedicatedTaskThread: threadId: 125108656997952, task id: 1\n21:11:54.676 INFO  TorrentBroadcast - Started reading broadcast variable 1 with 1 pieces (estimated total size 4.0 MiB)\n21:11:54.680 INFO  MemoryStore - Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.1 KiB, free 119.8 GiB)\n21:11:54.682 INFO  TorrentBroadcast - Reading broadcast variable 1 took 6 ms\n21:11:54.683 INFO  MemoryStore - Block broadcast_1 stored as values in memory (estimated size 6.3 KiB, free 119.8 GiB)\n2025-12-14 21:11:54,885 - INFO - Starting vLLM server with command: \/usr\/bin\/python3.10 -m vllm.entrypoints.openai.api_server --model \/mnt\/models --served-model-name qwen-2.5-14b --port 7000 --tensor_parallel_size 1 --gpu_memory_utilization 0.5 --max_model_len 6600\n(APIServer pid=12109) INFO 12-14 21:12:00 [api_server.py:1351] vLLM API server version 0.13.0rc2.dev139+g9ccbf6b69\n(APIServer pid=12109) INFO 12-14 21:12:00 [utils.py:253] non-default args: {'port': 7000, 'model': '\/mnt\/models', 'max_model_len': 6600, 'served_model_name': ['qwen-2.5-14b'], 'gpu_memory_utilization': 0.5}\n(APIServer pid=12109) INFO 12-14 21:12:00 [model.py:514] Resolved architecture: Qwen2ForCausalLM\n(APIServer pid=12109) INFO 12-14 21:12:00 [model.py:1652] Using max model len 6600\n(APIServer pid=12109) INFO 12-14 21:12:01 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:07 [core.py:93] Initializing a V1 LLM engine (v0.13.0rc2.dev139+g9ccbf6b69) with config: model='\/mnt\/models', speculative_config=None, tokenizer='\/mnt\/models', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=6600, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01, cudagraph_metrics=False, enable_layerwise_nvtx_tracing=False), seed=0, served_model_name=qwen-2.5-14b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': &lt;CompilationMode.VLLM_COMPILE: 3&gt;, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'compile_ranges_split_points': [2048], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': &lt;CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)&gt;, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 512, 'dynamic_shapes_config': {'type': &lt;DynamicShapesType.BACKED: 'backed'&gt;, 'evaluate_guards': False}, 'local_cache_dir': None}\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:08 [parallel_state.py:1203] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp:\/\/10.0.0.100:37021 backend=nccl\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:08 [parallel_state.py:1411] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:09 [gpu_model_runner.py:3562] Starting to load model \/mnt\/models...\n(EngineCore_DP0 pid=12131) \/usr\/local\/lib\/python3.10\/dist-packages\/tvm_ffi\/_optional_torch_c_dlpack.py:174: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled.\n(EngineCore_DP0 pid=12131) We recommend installing via `pip install torch-c-dlpack-ext`\n(EngineCore_DP0 pid=12131)   warnings.warn(\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:10 [cuda.py:412] Using FLASH_ATTN attention backend out of potential backends: ('FLASH_ATTN', 'FLASHINFER', 'TRITON_ATTN', 'FLEX_ATTENTION')\nLoading safetensors checkpoint shards:   0% Completed | 0\/8 [00:00&lt;?, ?it\/s]\nLoading safetensors checkpoint shards:  12% Completed | 1\/8 [00:00&lt;00:02,  2.49it\/s]\nLoading safetensors checkpoint shards:  25% Completed | 2\/8 [00:00&lt;00:02,  2.16it\/s]\nLoading safetensors checkpoint shards:  38% Completed | 3\/8 [00:01&lt;00:02,  2.07it\/s]\nLoading safetensors checkpoint shards:  50% Completed | 4\/8 [00:01&lt;00:01,  2.03it\/s]\nLoading safetensors checkpoint shards:  62% Completed | 5\/8 [00:02&lt;00:01,  2.02it\/s]\nLoading safetensors checkpoint shards:  75% Completed | 6\/8 [00:02&lt;00:00,  2.01it\/s]\nLoading safetensors checkpoint shards:  88% Completed | 7\/8 [00:03&lt;00:00,  2.00it\/s]\nLoading safetensors checkpoint shards: 100% Completed | 8\/8 [00:03&lt;00:00,  2.34it\/s]\nLoading safetensors checkpoint shards: 100% Completed | 8\/8 [00:03&lt;00:00,  2.16it\/s]\n(EngineCore_DP0 pid=12131) \n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:14 [default_loader.py:308] Loading weights took 3.75 seconds\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:15 [gpu_model_runner.py:3659] Model loading took 27.5681 GiB memory and 5.643590 seconds\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:22 [backends.py:643] Using cache directory: \/root\/.cache\/vllm\/torch_compile_cache\/32b5011098\/rank_0_0\/backbone for vLLM's torch.compile\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:22 [backends.py:703] Dynamo bytecode transform time: 7.37 s\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:29 [backends.py:261] Cache the graph of compile range (1, 2048) for later use\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:40 [backends.py:278] Compiling a graph for compile range (1, 2048) takes 13.15 s\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:40 [monitor.py:34] torch.compile takes 20.51 s in total\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:41 [gpu_worker.py:375] Available KV cache memory: 10.48 GiB\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:41 [kv_cache_utils.py:1291] GPU KV cache size: 57,216 tokens\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:41 [kv_cache_utils.py:1296] Maximum concurrency for 6,600 tokens per request: 8.66x\nCapturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 51\/51 [00:03&lt;00:00, 14.71it\/s]\nCapturing CUDA graphs (decode, FULL): 100%|\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588\u2588| 35\/35 [00:01&lt;00:00, 18.59it\/s]\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:47 [gpu_model_runner.py:4610] Graph capturing finished in 6 secs, took 3.05 GiB\n(EngineCore_DP0 pid=12131) INFO 12-14 21:12:47 [core.py:259] init engine (profile, create kv cache, warmup model) took 32.31 seconds\n(APIServer pid=12109) INFO 12-14 21:12:48 [api_server.py:1099] Supported tasks: ['generate']\n(APIServer pid=12109) WARNING 12-14 21:12:48 [model.py:1478] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`.\n(APIServer pid=12109) INFO 12-14 21:12:48 [serving_responses.py:201] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}\n(APIServer pid=12109) INFO 12-14 21:12:48 [serving_chat.py:137] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}\n(APIServer pid=12109) INFO 12-14 21:12:48 [serving_completion.py:77] Using default completion sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}\n(APIServer pid=12109) INFO 12-14 21:12:48 [serving_chat.py:137] Using default chat sampling params from model: {'repetition_penalty': 1.05, 'temperature': 0.7, 'top_k': 20, 'top_p': 0.8}\n(APIServer pid=12109) INFO 12-14 21:12:48 [api_server.py:1425] Starting vLLM API server 0 on http:\/\/0.0.0.0:7000\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:38] Available routes are:\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/openapi.json, Methods: GET, HEAD\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/docs, Methods: GET, HEAD\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/docs\/oauth2-redirect, Methods: GET, HEAD\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/redoc, Methods: GET, HEAD\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/scale_elastic_ep, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/is_scaling_elastic_ep, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/tokenize, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/detokenize, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/inference\/v1\/generate, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/pause, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/resume, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/is_paused, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/metrics, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/health, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/load, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/models, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/version, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/responses, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/responses\/{response_id}, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/responses\/{response_id}\/cancel, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/messages, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/chat\/completions, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/completions, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/audio\/transcriptions, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/audio\/translations, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/ping, Methods: GET\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/ping, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/invocations, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/classify, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/embeddings, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/score, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/score, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/rerank, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v1\/rerank, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/v2\/rerank, Methods: POST\n(APIServer pid=12109) INFO 12-14 21:12:48 [launcher.py:46] Route: \/pooling, Methods: POST\n(APIServer pid=12109) INFO:     Started server process [12109]\n(APIServer pid=12109) INFO:     Waiting for application startup.\n(APIServer pid=12109) INFO:     Application startup complete.\n(APIServer pid=12109) INFO:     127.0.0.1:47994 - \"GET \/health HTTP\/1.1\" 200 OK\n21:12:49.954 INFO  BarrierTaskContext - Task 1 from Stage 1(Attempt 0) has entered the global sync, current barrier epoch is 0.\n21:12:50.958 INFO  BarrierTaskContext - Task 1 from Stage 1(Attempt 0) finished global sync successfully, waited for 1 seconds, current barrier epoch is 1.\n21:12:50.958 INFO  PythonRunner - Times: total = 56273, boot = 3, init = 146, finish = 56124\n21:12:50.961 INFO  Executor - Finished task 0.0 in stage 1.0 (TID 1). 1330 bytes result sent to driver\n21:13:07.898 INFO  CoarseGrainedExecutorBackend - Got assigned task 2\n21:13:07.899 INFO  Executor - Running task 0.0 in stage 2.0 (TID 2)\n21:13:07.901 INFO  SparkResourceAdaptor - startDedicatedTaskThread: threadId: 125108656997952, task id: 2\n21:13:07.902 INFO  TorrentBroadcast - Started reading broadcast variable 2 with 1 pieces (estimated total size 4.0 MiB)\n21:13:07.906 INFO  MemoryStore - Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.9 KiB, free 119.8 GiB)\n21:13:07.908 INFO  TorrentBroadcast - Reading broadcast variable 2 took 6 ms\n21:13:07.909 INFO  MemoryStore - Block broadcast_2 stored as values in memory (estimated size 6.1 KiB, free 119.8 GiB)\n21:13:13.435 INFO  PythonRunner - Times: total = 5523, boot = 3, init = 140, finish = 5380\n21:13:13.436 INFO  Executor - Finished task 0.0 in stage 2.0 (TID 2). 5519 bytes result sent to driver\n21:13:13.508 INFO  CoarseGrainedExecutorBackend - Got assigned task 3\n21:13:13.509 INFO  Executor - Running task 0.0 in stage 3.0 (TID 3)\n21:13:13.512 INFO  SparkResourceAdaptor - startDedicatedTaskThread: threadId: 125108656997952, task id: 3\n21:13:13.513 INFO  TorrentBroadcast - Started reading broadcast variable 3 with 1 pieces (estimated total size 4.0 MiB)\n21:13:13.517 INFO  MemoryStore - Block broadcast_3_piece0 stored as bytes in memory (estimated size 4.1 KiB, free 119.8 GiB)\n21:13:13.519 INFO  TorrentBroadcast - Reading broadcast variable 3 took 5 ms\n21:13:13.520 INFO  MemoryStore - Block broadcast_3 stored as values in memory (estimated size 6.3 KiB, free 119.8 GiB)\n2025-12-14 21:13:13,717 - INFO - Starting vLLM server with command: \/usr\/bin\/python3.10 -m vllm.entrypoints.openai.api_server --model \/azure\/models --served-model-name qwen-2.5-14b --port 7001 --tensor_parallel_size 1 --gpu_memory_utilization 0.5 --max_model_len 6600\n(APIServer pid=12548) INFO 12-14 21:13:19 [api_server.py:1351] vLLM API server version 0.13.0rc2.dev139+g9ccbf6b69\n(APIServer pid=12548) INFO 12-14 21:13:19 [utils.py:253] non-default args: {'port': 7001, 'model': '\/azure\/models', 'max_model_len': 6600, 'served_model_name': ['qwen-2.5-14b'], 'gpu_memory_utilization': 0.5}<\/pre>\n\n\n\n<p style=\"text-align: justify;\">\u5617\u8a66\u6539\u6210 openai\/gpt-oss-20b \u7684\u6a21\u578b\u4e4b\u5f8c\uff0c\u7522\u751f\u4e86\u53e6\u5916\u7684\u554f\u984c\uff0c\u767c\u73fe\u554f\u984c\u4e3b\u8981\u662f\u56e0\u70ba vllm \u5728\u5b89\u88dd\u7684\u6642\u5019\u6211\u5011\u4f7f\u7528\u7684\u662f Build wheel from source \u7684\u65b9\u6cd5\uff0c\u6240\u4ee5\u51fa\u73fe\u4e86 C++ \u7a0b\u5f0f\u4e0d\u5339\u914d\u7684\u554f\u984c\uff0c\u6211\u5011\u5f8c\u4f86\u5229\u7528 Pre-built wheels \u7684\u65b9\u5f0f\u5b89\u88dd vllm \u5c31\u6c92\u6709\u518d\u51fa\u73fe\u4e00\u6a23\u7684\u554f\u984c\u4e86\uff0cGPT-OSS-20B \u6709\u7528\u5230 Multiple-of-Expert (MoE) \u7684\u6280\u8853\uff0c\u6240\u4ee5\u90e8\u7f72\u4e0a\u76f8\u5c0d Qwen \u8f03\u56f0\u96e3\u4e00\u9ede\u3002<\/p>\n<pre class=\"lang:bash\">12:40:32.462 INFO  MemoryStore - Block broadcast_2 stored as values in memory (estimated size 6.3 KiB, free 119.8 GiB)\n2025-12-16 12:40:32,655 - INFO - Starting vLLM server with command: \/usr\/bin\/python3.10 -m vllm.entrypoints.openai.api_server --model \/mnt\/gpt-oss-20b --served-model-name openai\/gpt-oss-20b --port 7000 --tensor_parallel_size 1 --gpu_memory_utilization 0.65 --max_model_len 6600 --task generate\nWARNING 12-16 12:40:38 [argparse_utils.py:82] argument 'task' is deprecated\n(APIServer pid=14588) INFO 12-16 12:40:38 [api_server.py:1772] vLLM API server version 0.12.0\n(APIServer pid=14588) INFO 12-16 12:40:38 [utils.py:253] non-default args: {'port': 7000, 'model': '\/mnt\/gpt-oss-20b', 'task': 'generate', 'max_model_len': 6600, 'served_model_name': ['openai\/gpt-oss-20b'], 'gpu_memory_utilization': 0.65}\n(APIServer pid=14588) INFO 12-16 12:40:38 [model.py:637] Resolved architecture: GptOssForCausalLM\n(APIServer pid=14588) ERROR 12-16 12:40:38 [repo_utils.py:65] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace\/repo_name': '\/mnt\/gpt-oss-20b'. Use `repo_type` argument if needed., retrying 1 of 2\n(APIServer pid=14588) ERROR 12-16 12:40:40 [repo_utils.py:63] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace\/repo_name': '\/mnt\/gpt-oss-20b'. Use `repo_type` argument if needed.\n(APIServer pid=14588) INFO 12-16 12:40:40 [model.py:2086] Downcasting torch.float32 to torch.bfloat16.\n(APIServer pid=14588) INFO 12-16 12:40:40 [model.py:1750] Using max model len 6600\n(APIServer pid=14588) INFO 12-16 12:40:41 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.\n(APIServer pid=14588) INFO 12-16 12:40:41 [config.py:274] Overriding max cuda graph capture size to 1024 for performance.\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:48 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='\/mnt\/gpt-oss-20b', speculative_config=None, tokenizer='\/mnt\/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=6600, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=openai\/gpt-oss-20b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': &lt;CompilationMode.VLLM_COMPILE: 3&gt;, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': &lt;CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)&gt;, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': &lt;DynamicShapesType.BACKED: 'backed'&gt;}, 'local_cache_dir': None}\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:48 [parallel_state.py:1200] world_size=1 rank=0 local_rank=0 distributed_init_method=tcp:\/\/10.0.0.100:55309 backend=nccl\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:48 [parallel_state.py:1408] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, PCP rank 0, TP rank 0, EP rank 0\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:49 [gpu_model_runner.py:3467] Starting to load model \/mnt\/gpt-oss-20b...\n(EngineCore_DP0 pid=14609) \/usr\/local\/lib\/python3.10\/dist-packages\/tvm_ffi\/_optional_torch_c_dlpack.py:174: UserWarning: Failed to JIT torch c dlpack extension, EnvTensorAllocator will not be enabled.\n(EngineCore_DP0 pid=14609) We recommend installing via `pip install torch-c-dlpack-ext`\n(EngineCore_DP0 pid=14609)   warnings.warn(\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:51 [cuda.py:411] Using TRITON_ATTN attention backend out of potential backends: ['TRITON_ATTN']\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:51 [layer.py:379] Enabled separate cuda stream for MoE shared_experts\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:51 [mxfp4.py:162] Using Marlin backend\nLoading safetensors checkpoint shards:   0% Completed | 0\/3 [00:00&lt;?, ?it\/s]\nLoading safetensors checkpoint shards:  33% Completed | 1\/3 [00:00&lt;00:01,  1.94it\/s]\nLoading safetensors checkpoint shards:  67% Completed | 2\/3 [00:01&lt;00:00,  1.71it\/s]\nLoading safetensors checkpoint shards: 100% Completed | 3\/3 [00:01&lt;00:00,  1.75it\/s]\nLoading safetensors checkpoint shards: 100% Completed | 3\/3 [00:01&lt;00:00,  1.76it\/s]\n(EngineCore_DP0 pid=14609) \n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:53 [default_loader.py:308] Loading weights took 1.82 seconds\n(EngineCore_DP0 pid=14609) WARNING 12-16 12:40:53 [marlin_utils_fp4.py:226] Your GPU does not have native support for FP4 computation but FP4 quantization is being used. Weight-only FP4 compression will be used leveraging the Marlin kernel. This may degrade performance for compute-heavy workloads.\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:53 [gpu_model_runner.py:3549] Model loading took 13.7194 GiB memory and 4.125383 seconds\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:58 [backends.py:655] Using cache directory: \/root\/.cache\/vllm\/torch_compile_cache\/76ebb93956\/rank_0_0\/backbone for vLLM's torch.compile\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:58 [backends.py:715] Dynamo bytecode transform time: 3.96 s\n(EngineCore_DP0 pid=14609) INFO 12-16 12:40:58 [backends.py:257] Cache the graph for dynamic shape for later use\n(EngineCore_DP0 pid=14609) INFO 12-16 12:41:01 [backends.py:288] Compiling a graph for dynamic shape takes 3.01 s\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843] EngineCore failed to start.\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843] Traceback (most recent call last):\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 834, in run_engine_core\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 610, in __init__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     super().__init__(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 109, in __init__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 235, in _initialize_kv_caches\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     available_gpu_memory = self.model_executor.determine_available_memory()\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/executor\/abstract.py\", line 126, in determine_available_memory\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.collective_rpc(\"determine_available_memory\")\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/executor\/uniproc_executor.py\", line 75, in collective_rpc\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     result = run_method(self.driver_worker, method, args, kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/serial_utils.py\", line 479, in run_method\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return func(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/utils\/_contextlib.py\", line 120, in decorate_context\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return func(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/worker\/gpu_worker.py\", line 324, in determine_available_memory\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     self.model_runner.profile_run()\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/worker\/gpu_model_runner.py\", line 4357, in profile_run\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     hidden_states, last_hidden_states = self._dummy_run(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/utils\/_contextlib.py\", line 120, in decorate_context\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return func(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/v1\/worker\/gpu_model_runner.py\", line 4071, in _dummy_run\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     outputs = self.model(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/compilation\/cuda_graph.py\", line 126, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.runnable(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1775, in _wrapped_call_impl\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._call_impl(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1786, in _call_impl\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return forward_call(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/models\/gpt_oss.py\", line 723, in forward\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/compilation\/decorators.py\", line 514, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     output = TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/compilation\/wrapper.py\", line 171, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._compiled_callable(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_dynamo\/eval_frame.py\", line 832, in compile_wrapper\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return fn(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/models\/gpt_oss.py\", line 277, in forward\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     def forward(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_dynamo\/eval_frame.py\", line 1044, in _fn\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return fn(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/compilation\/caching.py\", line 54, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.optimized_call(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/fx\/graph_module.py\", line 837, in call_wrapped\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._wrapped_call(self, *args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/fx\/graph_module.py\", line 413, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     raise e\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/fx\/graph_module.py\", line 400, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1775, in _wrapped_call_impl\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._call_impl(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1786, in _call_impl\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return forward_call(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"&lt;eval_with_key&gt;.50\", line 209, in forward\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, getitem_4, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_bias_, l_positions_, l_self_modules_layers_modules_0_modules_attn_modules_rotary_emb_buffers_cos_sin_cache_);  getitem_3 = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_ = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = getitem_4 = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_bias_ = None\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/compilation\/cuda_graph.py\", line 126, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.runnable(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/compilation\/piecewise_backend.py\", line 93, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.compiled_graph_for_general_shape(*args)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_inductor\/standalone_compile.py\", line 63, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._compiled_fn(*args)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_dynamo\/eval_frame.py\", line 1044, in _fn\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return fn(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/aot_autograd.py\", line 1130, in forward\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return compiled_fn(full_args)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/_aot_autograd\/runtime_wrappers.py\", line 353, in runtime_wrapper\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     all_outs = call_func_at_runtime_with_args(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/_aot_autograd\/utils.py\", line 129, in call_func_at_runtime_with_args\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     out = normalize_as_list(f(args))\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/_aot_autograd\/runtime_wrappers.py\", line 526, in wrapper\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return compiled_fn(runtime_args)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_inductor\/output_code.py\", line 613, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.current_callable(inputs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_inductor\/utils.py\", line 2962, in run\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     out = model(new_inputs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/tmp\/torchinductor_token\/3s\/c3scxjen4xvb6yt77aqsailqhvbfyslfwmlffialkjtwqdnicmpz.py\", line 1014, in call\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     buf5 = torch.ops.vllm.moe_forward.default(buf4, buf3, 'model.layers.0.mlp.experts')\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_ops.py\", line 841, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._op(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/layer.py\", line 2101, in moe_forward\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self.forward_impl(hidden_states, router_logits)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/layer.py\", line 1960, in forward_impl\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     final_hidden_states = self.quant_method.apply(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/layers\/quantization\/mxfp4.py\", line 922, in apply\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return fused_marlin_moe(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/fused_marlin_moe.py\", line 318, in fused_marlin_moe\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     sorted_token_ids, expert_ids, num_tokens_post_padded = moe_align_block_size(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/moe_align_block_size.py\", line 79, in moe_align_block_size\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     ops.moe_align_block_size(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/home\/vllm\/vllm\/_custom_ops.py\", line 1881, in moe_align_block_size\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     torch.ops._moe_C.moe_align_block_size(\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_ops.py\", line 1255, in __call__\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843]     return self._op(*args, **kwargs)\n(EngineCore_DP0 pid=14609) ERROR 12-16 12:41:02 [core.py:843] RuntimeError: _moe_C::moe_align_block_size() is missing value for argument 'maybe_expert_map'. Declaration: _moe_C::moe_align_block_size(Tensor topk_ids, int num_experts, int block_size, Tensor($0! -&gt; ) sorted_token_ids, Tensor($1! -&gt; ) experts_ids, Tensor($2! -&gt; ) num_tokens_post_pad, Tensor? maybe_expert_map) -&gt; ()\n(EngineCore_DP0 pid=14609) Process EngineCore_DP0:\n(EngineCore_DP0 pid=14609) Traceback (most recent call last):\n(EngineCore_DP0 pid=14609)   File \"\/usr\/lib\/python3.10\/multiprocessing\/process.py\", line 314, in _bootstrap\n(EngineCore_DP0 pid=14609)     self.run()\n(EngineCore_DP0 pid=14609)   File \"\/usr\/lib\/python3.10\/multiprocessing\/process.py\", line 108, in run\n(EngineCore_DP0 pid=14609)     self._target(*self._args, **self._kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 847, in run_engine_core\n(EngineCore_DP0 pid=14609)     raise e\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 834, in run_engine_core\n(EngineCore_DP0 pid=14609)     engine_core = EngineCoreProc(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 610, in __init__\n(EngineCore_DP0 pid=14609)     super().__init__(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 109, in __init__\n(EngineCore_DP0 pid=14609)     num_gpu_blocks, num_cpu_blocks, kv_cache_config = self._initialize_kv_caches(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/engine\/core.py\", line 235, in _initialize_kv_caches\n(EngineCore_DP0 pid=14609)     available_gpu_memory = self.model_executor.determine_available_memory()\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/executor\/abstract.py\", line 126, in determine_available_memory\n(EngineCore_DP0 pid=14609)     return self.collective_rpc(\"determine_available_memory\")\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/executor\/uniproc_executor.py\", line 75, in collective_rpc\n(EngineCore_DP0 pid=14609)     result = run_method(self.driver_worker, method, args, kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/serial_utils.py\", line 479, in run_method\n(EngineCore_DP0 pid=14609)     return func(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/utils\/_contextlib.py\", line 120, in decorate_context\n(EngineCore_DP0 pid=14609)     return func(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/worker\/gpu_worker.py\", line 324, in determine_available_memory\n(EngineCore_DP0 pid=14609)     self.model_runner.profile_run()\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/worker\/gpu_model_runner.py\", line 4357, in profile_run\n(EngineCore_DP0 pid=14609)     hidden_states, last_hidden_states = self._dummy_run(\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/utils\/_contextlib.py\", line 120, in decorate_context\n(EngineCore_DP0 pid=14609)     return func(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/v1\/worker\/gpu_model_runner.py\", line 4071, in _dummy_run\n(EngineCore_DP0 pid=14609)     outputs = self.model(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/compilation\/cuda_graph.py\", line 126, in __call__\n(EngineCore_DP0 pid=14609)     return self.runnable(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1775, in _wrapped_call_impl\n(EngineCore_DP0 pid=14609)     return self._call_impl(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1786, in _call_impl\n(EngineCore_DP0 pid=14609)     return forward_call(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/models\/gpt_oss.py\", line 723, in forward\n(EngineCore_DP0 pid=14609)     return self.model(input_ids, positions, intermediate_tensors, inputs_embeds)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/compilation\/decorators.py\", line 514, in __call__\n(EngineCore_DP0 pid=14609)     output = TorchCompileWithNoGuardsWrapper.__call__(self, *args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/compilation\/wrapper.py\", line 171, in __call__\n(EngineCore_DP0 pid=14609)     return self._compiled_callable(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_dynamo\/eval_frame.py\", line 832, in compile_wrapper\n(EngineCore_DP0 pid=14609)     return fn(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/models\/gpt_oss.py\", line 277, in forward\n(EngineCore_DP0 pid=14609)     def forward(\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_dynamo\/eval_frame.py\", line 1044, in _fn\n(EngineCore_DP0 pid=14609)     return fn(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/compilation\/caching.py\", line 54, in __call__\n(EngineCore_DP0 pid=14609)     return self.optimized_call(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/fx\/graph_module.py\", line 837, in call_wrapped\n(EngineCore_DP0 pid=14609)     return self._wrapped_call(self, *args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/fx\/graph_module.py\", line 413, in __call__\n(EngineCore_DP0 pid=14609)     raise e\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/fx\/graph_module.py\", line 400, in __call__\n(EngineCore_DP0 pid=14609)     return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1775, in _wrapped_call_impl\n(EngineCore_DP0 pid=14609)     return self._call_impl(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/nn\/modules\/module.py\", line 1786, in _call_impl\n(EngineCore_DP0 pid=14609)     return forward_call(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"&lt;eval_with_key&gt;.50\", line 209, in forward\n(EngineCore_DP0 pid=14609)     submod_2 = self.submod_2(getitem_3, s72, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_, l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_, l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_, getitem_4, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_, l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_, l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_weight_, l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_bias_, l_positions_, l_self_modules_layers_modules_0_modules_attn_modules_rotary_emb_buffers_cos_sin_cache_);  getitem_3 = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_weight_ = l_self_modules_layers_modules_0_modules_attn_modules_o_proj_parameters_bias_ = l_self_modules_layers_modules_0_modules_post_attention_layernorm_parameters_weight_ = getitem_4 = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_weight_ = l_self_modules_layers_modules_0_modules_mlp_modules_router_parameters_bias_ = l_self_modules_layers_modules_1_modules_input_layernorm_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_weight_ = l_self_modules_layers_modules_1_modules_attn_modules_qkv_proj_parameters_bias_ = None\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/compilation\/cuda_graph.py\", line 126, in __call__\n(EngineCore_DP0 pid=14609)     return self.runnable(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/compilation\/piecewise_backend.py\", line 93, in __call__\n(EngineCore_DP0 pid=14609)     return self.compiled_graph_for_general_shape(*args)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_inductor\/standalone_compile.py\", line 63, in __call__\n(EngineCore_DP0 pid=14609)     return self._compiled_fn(*args)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_dynamo\/eval_frame.py\", line 1044, in _fn\n(EngineCore_DP0 pid=14609)     return fn(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/aot_autograd.py\", line 1130, in forward\n(EngineCore_DP0 pid=14609)     return compiled_fn(full_args)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/_aot_autograd\/runtime_wrappers.py\", line 353, in runtime_wrapper\n(EngineCore_DP0 pid=14609)     all_outs = call_func_at_runtime_with_args(\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/_aot_autograd\/utils.py\", line 129, in call_func_at_runtime_with_args\n(EngineCore_DP0 pid=14609)     out = normalize_as_list(f(args))\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_functorch\/_aot_autograd\/runtime_wrappers.py\", line 526, in wrapper\n(EngineCore_DP0 pid=14609)     return compiled_fn(runtime_args)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_inductor\/output_code.py\", line 613, in __call__\n(EngineCore_DP0 pid=14609)     return self.current_callable(inputs)\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_inductor\/utils.py\", line 2962, in run\n(EngineCore_DP0 pid=14609)     out = model(new_inputs)\n(EngineCore_DP0 pid=14609)   File \"\/tmp\/torchinductor_token\/3s\/c3scxjen4xvb6yt77aqsailqhvbfyslfwmlffialkjtwqdnicmpz.py\", line 1014, in call\n(EngineCore_DP0 pid=14609)     buf5 = torch.ops.vllm.moe_forward.default(buf4, buf3, 'model.layers.0.mlp.experts')\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_ops.py\", line 841, in __call__\n(EngineCore_DP0 pid=14609)     return self._op(*args, **kwargs)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/layer.py\", line 2101, in moe_forward\n(EngineCore_DP0 pid=14609)     return self.forward_impl(hidden_states, router_logits)\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/layer.py\", line 1960, in forward_impl\n(EngineCore_DP0 pid=14609)     final_hidden_states = self.quant_method.apply(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/layers\/quantization\/mxfp4.py\", line 922, in apply\n(EngineCore_DP0 pid=14609)     return fused_marlin_moe(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/fused_marlin_moe.py\", line 318, in fused_marlin_moe\n(EngineCore_DP0 pid=14609)     sorted_token_ids, expert_ids, num_tokens_post_padded = moe_align_block_size(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/model_executor\/layers\/fused_moe\/moe_align_block_size.py\", line 79, in moe_align_block_size\n(EngineCore_DP0 pid=14609)     ops.moe_align_block_size(\n(EngineCore_DP0 pid=14609)   File \"\/home\/vllm\/vllm\/_custom_ops.py\", line 1881, in moe_align_block_size\n(EngineCore_DP0 pid=14609)     torch.ops._moe_C.moe_align_block_size(\n(EngineCore_DP0 pid=14609)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/_ops.py\", line 1255, in __call__\n(EngineCore_DP0 pid=14609)     return self._op(*args, **kwargs)\n(EngineCore_DP0 pid=14609) RuntimeError: _moe_C::moe_align_block_size() is missing value for argument 'maybe_expert_map'. Declaration: _moe_C::moe_align_block_size(Tensor topk_ids, int num_experts, int block_size, Tensor($0! -&gt; ) sorted_token_ids, Tensor($1! -&gt; ) experts_ids, Tensor($2! -&gt; ) num_tokens_post_pad, Tensor? maybe_expert_map) -&gt; ()\n[rank0]:[W1216 12:41:02.145493395 ProcessGroupNCCL.cpp:1524] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For more info, please see https:\/\/pytorch.org\/docs\/stable\/distributed.html#shutdown (function operator())\n(APIServer pid=14588) Traceback (most recent call last):\n(APIServer pid=14588)   File \"\/usr\/lib\/python3.10\/runpy.py\", line 196, in _run_module_as_main\n(APIServer pid=14588)     return _run_code(code, main_globals, None,\n(APIServer pid=14588)   File \"\/usr\/lib\/python3.10\/runpy.py\", line 86, in _run_code\n(APIServer pid=14588)     exec(code, run_globals)\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/entrypoints\/openai\/api_server.py\", line 1891, in &lt;module&gt;\n(APIServer pid=14588)     uvloop.run(run_server(args))\n(APIServer pid=14588)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/uvloop\/__init__.py\", line 69, in run\n(APIServer pid=14588)     return loop.run_until_complete(wrapper())\n(APIServer pid=14588)   File \"uvloop\/loop.pyx\", line 1518, in uvloop.loop.Loop.run_until_complete\n(APIServer pid=14588)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/uvloop\/__init__.py\", line 48, in wrapper\n(APIServer pid=14588)     return await main\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/entrypoints\/openai\/api_server.py\", line 1819, in run_server\n(APIServer pid=14588)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/entrypoints\/openai\/api_server.py\", line 1838, in run_server_worker\n(APIServer pid=14588)     async with build_async_engine_client(\n(APIServer pid=14588)   File \"\/usr\/lib\/python3.10\/contextlib.py\", line 199, in __aenter__\n(APIServer pid=14588)     return await anext(self.gen)\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/entrypoints\/openai\/api_server.py\", line 183, in build_async_engine_client\n(APIServer pid=14588)     async with build_async_engine_client_from_engine_args(\n(APIServer pid=14588)   File \"\/usr\/lib\/python3.10\/contextlib.py\", line 199, in __aenter__\n(APIServer pid=14588)     return await anext(self.gen)\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/entrypoints\/openai\/api_server.py\", line 224, in build_async_engine_client_from_engine_args\n(APIServer pid=14588)     async_llm = AsyncLLM.from_vllm_config(\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/async_llm.py\", line 223, in from_vllm_config\n(APIServer pid=14588)     return cls(\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/async_llm.py\", line 134, in __init__\n(APIServer pid=14588)     self.engine_core = EngineCoreClient.make_async_mp_client(\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/core_client.py\", line 121, in make_async_mp_client\n(APIServer pid=14588)     return AsyncMPClient(*client_args)\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/core_client.py\", line 810, in __init__\n(APIServer pid=14588)     super().__init__(\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/core_client.py\", line 471, in __init__\n(APIServer pid=14588)     with launch_core_engines(vllm_config, executor_class, log_stats) as (\n(APIServer pid=14588)   File \"\/usr\/lib\/python3.10\/contextlib.py\", line 142, in __exit__\n(APIServer pid=14588)     next(self.gen)\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/utils.py\", line 903, in launch_core_engines\n(APIServer pid=14588)     wait_for_engine_startup(\n(APIServer pid=14588)   File \"\/home\/vllm\/vllm\/v1\/engine\/utils.py\", line 960, in wait_for_engine_startup\n(APIServer pid=14588)     raise RuntimeError(\n(APIServer pid=14588) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}<\/pre>\n<p>\u00a0<\/p>\n\n\n\n<p style=\"text-align: justify;\">\u89e3\u6c7a\u4e86 vLLM \u5b89\u88dd\u7684\u554f\u984c\u4e4b\u5f8c\uff0c\u53e6\u5916\u53c8\u51fa\u73fe CUDA Driver \u592a\u820a\u7684\u932f\u8aa4\u8a0a\u606f\uff0c\u5373\u4fbf\u5347\u7d1a Nvidia Driver \u4e4b\u5f8c\u9084\u662f\u7121\u6cd5\u76f4\u63a5\u5c07\u6253\u5305\u597d\u7684 image \u8dd1\u5728 K8S \u7684\u74b0\u5883\u88e1\uff0c\u56e0\u70ba\u6ab8\u6aac\u7238\u662f\u4f7f\u7528 <a href=\"https:\/\/myoceane.fr\/index.php\/%e5%9c%a8-k8s-%e4%b8%8a%e7%b0%a1%e5%96%ae%e5%af%a6%e7%8f%be-nvidia-gpu-time-slicing\/\">AKS \u52a0 Device Plugin \u7684\u65b9\u5f0f\u69cb\u5efa GPU \u7684\u74b0\u5883<\/a>\uff0c\u9019\u662f\u4e00\u500b\u6bd4\u8f03\u7c21\u6613\u7684\u65b9\u5f0f\uff0c\u50cf\u8981\u6bd4\u8f03\u5b8c\u6574\u7684\u74b0\u5883\uff0c\u9084\u662f\u8981\u8003\u616e\u4f7f\u7528 GPU Operator\u3002<\/p>\n<pre class=\"lang:bash\">025-12-16 14:49:44,345 - INFO - Starting vLLM server with command: \/usr\/bin\/python3.10 -m vllm.entrypoints.openai.api_server --model \/mnt\/gpt-oss-20b --served-model-name openai\/gpt-oss-20b --port 7000 --tensor_parallel_size 1 --gpu_memory_utilization 0.65 --max_model_len 6600 --task generate\nWARNING 12-16 14:49:49 [argparse_utils.py:82] argument 'task' is deprecated\n(APIServer pid=9353) INFO 12-16 14:49:49 [api_server.py:1772] vLLM API server version 0.12.0\n(APIServer pid=9353) INFO 12-16 14:49:49 [utils.py:253] non-default args: {'port': 7000, 'model': '\/mnt\/gpt-oss-20b', 'task': 'generate', 'max_model_len': 6600, 'served_model_name': ['openai\/gpt-oss-20b'], 'gpu_memory_utilization': 0.65}\n(APIServer pid=9353) INFO 12-16 14:49:49 [model.py:637] Resolved architecture: GptOssForCausalLM\n(APIServer pid=9353) ERROR 12-16 14:49:49 [repo_utils.py:65] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace\/repo_name': '\/mnt\/gpt-oss-20b'. Use `repo_type` argument if needed., retrying 1 of 2\n(APIServer pid=9353) ERROR 12-16 14:49:51 [repo_utils.py:63] Error retrieving safetensors: Repo id must be in the form 'repo_name' or 'namespace\/repo_name': '\/mnt\/gpt-oss-20b'. Use `repo_type` argument if needed.\n(APIServer pid=9353) INFO 12-16 14:49:51 [model.py:2086] Downcasting torch.float32 to torch.bfloat16.\n(APIServer pid=9353) INFO 12-16 14:49:51 [model.py:1750] Using max model len 6600\n(APIServer pid=9353) INFO 12-16 14:49:53 [scheduler.py:228] Chunked prefill is enabled with max_num_batched_tokens=2048.\n(APIServer pid=9353) INFO 12-16 14:49:53 [config.py:274] Overriding max cuda graph capture size to 1024 for performance.\n(EngineCore_DP0 pid=9378) INFO 12-16 14:49:59 [core.py:93] Initializing a V1 LLM engine (v0.12.0) with config: model='\/mnt\/gpt-oss-20b', speculative_config=None, tokenizer='\/mnt\/gpt-oss-20b', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=6600, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=mxfp4, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser='openai_gptoss', reasoning_parser_plugin='', enable_in_reasoning=False), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None, kv_cache_metrics=False, kv_cache_metrics_sample=0.01), seed=0, served_model_name=openai\/gpt-oss-20b, enable_prefix_caching=True, enable_chunked_prefill=True, pooler_config=None, compilation_config={'level': None, 'mode': &lt;CompilationMode.VLLM_COMPILE: 3&gt;, 'debug_dump_path': None, 'cache_dir': '', 'compile_cache_save_format': 'binary', 'backend': 'inductor', 'custom_ops': ['none'], 'splitting_ops': ['vllm::unified_attention', 'vllm::unified_attention_with_output', 'vllm::unified_mla_attention', 'vllm::unified_mla_attention_with_output', 'vllm::mamba_mixer2', 'vllm::mamba_mixer', 'vllm::short_conv', 'vllm::linear_attention', 'vllm::plamo2_mamba_mixer', 'vllm::gdn_attention_core', 'vllm::kda_attention', 'vllm::sparse_attn_indexer'], 'compile_mm_encoder': False, 'compile_sizes': [], 'inductor_compile_config': {'enable_auto_functionalized_v2': False, 'combo_kernels': True, 'benchmark_combo_kernel': True}, 'inductor_passes': {}, 'cudagraph_mode': &lt;CUDAGraphMode.FULL_AND_PIECEWISE: (2, 1)&gt;, 'cudagraph_num_of_warmups': 1, 'cudagraph_capture_sizes': [1, 2, 4, 8, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256, 272, 288, 304, 320, 336, 352, 368, 384, 400, 416, 432, 448, 464, 480, 496, 512, 528, 544, 560, 576, 592, 608, 624, 640, 656, 672, 688, 704, 720, 736, 752, 768, 784, 800, 816, 832, 848, 864, 880, 896, 912, 928, 944, 960, 976, 992, 1008, 1024], 'cudagraph_copy_inputs': False, 'cudagraph_specialize_lora': True, 'use_inductor_graph_partition': False, 'pass_config': {'fuse_norm_quant': False, 'fuse_act_quant': False, 'fuse_attn_quant': False, 'eliminate_noops': True, 'enable_sp': False, 'fuse_gemm_comms': False, 'fuse_allreduce_rms': False}, 'max_cudagraph_capture_size': 1024, 'dynamic_shapes_config': {'type': &lt;DynamicShapesType.BACKED: 'backed'&gt;}, 'local_cache_dir': None}\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843] EngineCore failed to start.\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843] Traceback (most recent call last):\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 834, in run_engine_core\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     engine_core = EngineCoreProc(*args, **kwargs)\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 610, in __init__\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     super().__init__(\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 102, in __init__\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     self.model_executor = executor_class(vllm_config)\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/executor\/abstract.py\", line 101, in __init__\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     self._init_executor()\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/executor\/uniproc_executor.py\", line 46, in _init_executor\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     self.driver_worker.init_worker(all_kwargs=[kwargs])\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/worker\/worker_base.py\", line 255, in init_worker\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     worker_class = resolve_obj_by_qualname(\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/utils\/import_utils.py\", line 122, in resolve_obj_by_qualname\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     module = importlib.import_module(module_name)\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/lib\/python3.10\/importlib\/__init__.py\", line 126, in import_module\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     return _bootstrap._gcd_import(name[level:], package, level)\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"&lt;frozen importlib._bootstrap&gt;\", line 1050, in _gcd_import\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"&lt;frozen importlib._bootstrap&gt;\", line 1027, in _find_and_load\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"&lt;frozen importlib._bootstrap&gt;\", line 1006, in _find_and_load_unlocked\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"&lt;frozen importlib._bootstrap&gt;\", line 688, in _load_unlocked\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"&lt;frozen importlib._bootstrap_external&gt;\", line 883, in exec_module\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"&lt;frozen importlib._bootstrap&gt;\", line 241, in _call_with_frames_removed\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/worker\/gpu_worker.py\", line 54, in &lt;module&gt;\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     from vllm.v1.worker.gpu_model_runner import GPUModelRunner\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/worker\/gpu_model_runner.py\", line 140, in &lt;module&gt;\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     from vllm.v1.spec_decode.eagle import EagleProposer\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/spec_decode\/eagle.py\", line 30, in &lt;module&gt;\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/attention\/backends\/flash_attn.py\", line 230, in &lt;module&gt;\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[FlashAttentionMetadata]):\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/attention\/backends\/flash_attn.py\", line 251, in FlashAttentionMetadataBuilder\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     if get_flash_attn_version() == 3\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/attention\/utils\/fa_utils.py\", line 71, in get_flash_attn_version\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     if not is_fa_version_supported(fa_version):\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/vllm_flash_attn\/flash_attn_interface.py\", line 68, in is_fa_version_supported\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     return _is_fa2_supported(device)[0]\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/vllm_flash_attn\/flash_attn_interface.py\", line 43, in _is_fa2_supported\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     if torch.cuda.get_device_capability(device)[0] &lt; 8:\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/cuda\/__init__.py\", line 598, in get_device_capability\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     prop = get_device_properties(device)\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/cuda\/__init__.py\", line 614, in get_device_properties\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     _lazy_init()  # will define _get_device_properties\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/cuda\/__init__.py\", line 410, in _lazy_init\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843]     torch._C._cuda_init()\n(EngineCore_DP0 pid=9378) ERROR 12-16 14:50:00 [core.py:843] RuntimeError: The NVIDIA driver on your system is too old (found version 12020). Please update your GPU driver by downloading and installing a new version from the URL: http:\/\/www.nvidia.com\/Download\/index.aspx Alternatively, go to: https:\/\/pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.\n(EngineCore_DP0 pid=9378) Process EngineCore_DP0:\n(EngineCore_DP0 pid=9378) Traceback (most recent call last):\n(EngineCore_DP0 pid=9378)   File \"\/usr\/lib\/python3.10\/multiprocessing\/process.py\", line 314, in _bootstrap\n(EngineCore_DP0 pid=9378)     self.run()\n(EngineCore_DP0 pid=9378)   File \"\/usr\/lib\/python3.10\/multiprocessing\/process.py\", line 108, in run\n(EngineCore_DP0 pid=9378)     self._target(*self._args, **self._kwargs)\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 847, in run_engine_core\n(EngineCore_DP0 pid=9378)     raise e\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 834, in run_engine_core\n(EngineCore_DP0 pid=9378)     engine_core = EngineCoreProc(*args, **kwargs)\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 610, in __init__\n(EngineCore_DP0 pid=9378)     super().__init__(\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core.py\", line 102, in __init__\n(EngineCore_DP0 pid=9378)     self.model_executor = executor_class(vllm_config)\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/executor\/abstract.py\", line 101, in __init__\n(EngineCore_DP0 pid=9378)     self._init_executor()\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/executor\/uniproc_executor.py\", line 46, in _init_executor\n(EngineCore_DP0 pid=9378)     self.driver_worker.init_worker(all_kwargs=[kwargs])\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/worker\/worker_base.py\", line 255, in init_worker\n(EngineCore_DP0 pid=9378)     worker_class = resolve_obj_by_qualname(\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/utils\/import_utils.py\", line 122, in resolve_obj_by_qualname\n(EngineCore_DP0 pid=9378)     module = importlib.import_module(module_name)\n(EngineCore_DP0 pid=9378)   File \"\/usr\/lib\/python3.10\/importlib\/__init__.py\", line 126, in import_module\n(EngineCore_DP0 pid=9378)     return _bootstrap._gcd_import(name[level:], package, level)\n(EngineCore_DP0 pid=9378)   File \"&lt;frozen importlib._bootstrap&gt;\", line 1050, in _gcd_import\n(EngineCore_DP0 pid=9378)   File \"&lt;frozen importlib._bootstrap&gt;\", line 1027, in _find_and_load\n(EngineCore_DP0 pid=9378)   File \"&lt;frozen importlib._bootstrap&gt;\", line 1006, in _find_and_load_unlocked\n(EngineCore_DP0 pid=9378)   File \"&lt;frozen importlib._bootstrap&gt;\", line 688, in _load_unlocked\n(EngineCore_DP0 pid=9378)   File \"&lt;frozen importlib._bootstrap_external&gt;\", line 883, in exec_module\n(EngineCore_DP0 pid=9378)   File \"&lt;frozen importlib._bootstrap&gt;\", line 241, in _call_with_frames_removed\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/worker\/gpu_worker.py\", line 54, in &lt;module&gt;\n(EngineCore_DP0 pid=9378)     from vllm.v1.worker.gpu_model_runner import GPUModelRunner\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/worker\/gpu_model_runner.py\", line 140, in &lt;module&gt;\n(EngineCore_DP0 pid=9378)     from vllm.v1.spec_decode.eagle import EagleProposer\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/spec_decode\/eagle.py\", line 30, in &lt;module&gt;\n(EngineCore_DP0 pid=9378)     from vllm.v1.attention.backends.flash_attn import FlashAttentionMetadata\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/attention\/backends\/flash_attn.py\", line 230, in &lt;module&gt;\n(EngineCore_DP0 pid=9378)     class FlashAttentionMetadataBuilder(AttentionMetadataBuilder[FlashAttentionMetadata]):\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/attention\/backends\/flash_attn.py\", line 251, in FlashAttentionMetadataBuilder\n(EngineCore_DP0 pid=9378)     if get_flash_attn_version() == 3\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/attention\/utils\/fa_utils.py\", line 71, in get_flash_attn_version\n(EngineCore_DP0 pid=9378)     if not is_fa_version_supported(fa_version):\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/vllm_flash_attn\/flash_attn_interface.py\", line 68, in is_fa_version_supported\n(EngineCore_DP0 pid=9378)     return _is_fa2_supported(device)[0]\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/vllm_flash_attn\/flash_attn_interface.py\", line 43, in _is_fa2_supported\n(EngineCore_DP0 pid=9378)     if torch.cuda.get_device_capability(device)[0] &lt; 8:\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/cuda\/__init__.py\", line 598, in get_device_capability\n(EngineCore_DP0 pid=9378)     prop = get_device_properties(device)\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/cuda\/__init__.py\", line 614, in get_device_properties\n(EngineCore_DP0 pid=9378)     _lazy_init()  # will define _get_device_properties\n(EngineCore_DP0 pid=9378)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/torch\/cuda\/__init__.py\", line 410, in _lazy_init\n(EngineCore_DP0 pid=9378)     torch._C._cuda_init()\n(EngineCore_DP0 pid=9378) RuntimeError: The NVIDIA driver on your system is too old (found version 12020). Please update your GPU driver by downloading and installing a new version from the URL: http:\/\/www.nvidia.com\/Download\/index.aspx Alternatively, go to: https:\/\/pytorch.org to install a PyTorch version that has been compiled with your version of the CUDA driver.\n(APIServer pid=9353) Traceback (most recent call last):\n(APIServer pid=9353)   File \"\/usr\/lib\/python3.10\/runpy.py\", line 196, in _run_module_as_main\n(APIServer pid=9353)     return _run_code(code, main_globals, None,\n(APIServer pid=9353)   File \"\/usr\/lib\/python3.10\/runpy.py\", line 86, in _run_code\n(APIServer pid=9353)     exec(code, run_globals)\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/entrypoints\/openai\/api_server.py\", line 1891, in &lt;module&gt;\n(APIServer pid=9353)     uvloop.run(run_server(args))\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/uvloop\/__init__.py\", line 69, in run\n(APIServer pid=9353)     return loop.run_until_complete(wrapper())\n(APIServer pid=9353)   File \"uvloop\/loop.pyx\", line 1518, in uvloop.loop.Loop.run_until_complete\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/uvloop\/__init__.py\", line 48, in wrapper\n(APIServer pid=9353)     return await main\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/entrypoints\/openai\/api_server.py\", line 1819, in run_server\n(APIServer pid=9353)     await run_server_worker(listen_address, sock, args, **uvicorn_kwargs)\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/entrypoints\/openai\/api_server.py\", line 1838, in run_server_worker\n(APIServer pid=9353)     async with build_async_engine_client(\n(APIServer pid=9353)   File \"\/usr\/lib\/python3.10\/contextlib.py\", line 199, in __aenter__\n(APIServer pid=9353)     return await anext(self.gen)\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/entrypoints\/openai\/api_server.py\", line 183, in build_async_engine_client\n(APIServer pid=9353)     async with build_async_engine_client_from_engine_args(\n(APIServer pid=9353)   File \"\/usr\/lib\/python3.10\/contextlib.py\", line 199, in __aenter__\n(APIServer pid=9353)     return await anext(self.gen)\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/entrypoints\/openai\/api_server.py\", line 224, in build_async_engine_client_from_engine_args\n(APIServer pid=9353)     async_llm = AsyncLLM.from_vllm_config(\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/async_llm.py\", line 223, in from_vllm_config\n(APIServer pid=9353)     return cls(\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/async_llm.py\", line 134, in __init__\n(APIServer pid=9353)     self.engine_core = EngineCoreClient.make_async_mp_client(\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core_client.py\", line 121, in make_async_mp_client\n(APIServer pid=9353)     return AsyncMPClient(*client_args)\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core_client.py\", line 810, in __init__\n(APIServer pid=9353)     super().__init__(\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/core_client.py\", line 471, in __init__\n(APIServer pid=9353)     with launch_core_engines(vllm_config, executor_class, log_stats) as (\n(APIServer pid=9353)   File \"\/usr\/lib\/python3.10\/contextlib.py\", line 142, in __exit__\n(APIServer pid=9353)     next(self.gen)\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/utils.py\", line 903, in launch_core_engines\n(APIServer pid=9353)     wait_for_engine_startup(\n(APIServer pid=9353)   File \"\/usr\/local\/lib\/python3.10\/dist-packages\/vllm\/v1\/engine\/utils.py\", line 960, in wait_for_engine_startup\n(APIServer pid=9353)     raise RuntimeError(\n(APIServer pid=9353) RuntimeError: Engine core initialization failed. See root cause above. Failed core proc(s): {}<\/pre>\n<p style=\"text-align: justify;\">\u9019\u908a\u727d\u626f\u5230\u74b0\u5883\u7684\u69cb\u5efa\uff0c\u8a73\u7d30\u7814\u7a76\u4e4b\u5f8c\u767c\u73fe\uff0c\u8981\u76f4\u63a5\u9a45\u52d5\u4e00\u500b\u5df2\u7d93\u6709\u88dd CUDA \u7684 image \u9700\u8981\u900f\u904e nvidia-docker2 \u6216\u8005\u662f nvidia-container-cli \u53bb\u958b\u555f docker image\uff0c\u5982\u679c nvidia-container-cli \u592a\u820a\u6216\u662f\u8207 Nvidia Driver \u7248\u672c\u4e0d\u5339\u914d\uff0c\u5c31\u7b97\u5f9e nvidia-smi \u770b\u5230\u5df2\u7d93\u662f\u6700\u65b0\u7684\u7248\u672c\u4e5f\u4e0d\u80fd\u5920\u5c07\u6709 CUDA \u7248\u672c\u7684 Image \u8dd1\u8d77\u4f86\uff0c\u5229\u7528 GPU Operator \u914d\u5408 K8S \u5c07 GPU \u5b89\u88dd\u597d\u4e4b\u5f8c\u5c31\u53ef\u4ee5\u6210\u529f\u57f7\u884c\u4e86\u3002<\/p>\n","protected":false},"excerpt":{"rendered":"<p>\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002<\/p>\n","protected":false},"author":1,"featured_media":10276,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[9,14,1848,176],"tags":[1981,19,152,1980],"class_list":["post-10234","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-bigdata-ml","category-it-technology","category-k8s","category-python","tag-inference","tag-python","tag-spark","tag-vllm"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v24.6 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>[LLM] Spark + Local vLLM Server - \u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane<\/title>\n<meta name=\"description\" content=\"\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"[LLM] Spark + Local vLLM Server - \u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane\" \/>\n<meta property=\"og:description\" content=\"\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002\" \/>\n<meta property=\"og:url\" content=\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\" \/>\n<meta property=\"og:site_name\" content=\"\u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane\" \/>\n<meta property=\"article:published_time\" content=\"2025-12-27T16:12:54+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2025-12-27T16:13:33+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1368\" \/>\n\t<meta property=\"og:image:height\" content=\"659\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"\u6ab8\u6aac\u7238\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"\u6ab8\u6aac\u7238\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"1 minute\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"Article\",\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#article\",\"isPartOf\":{\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\"},\"author\":{\"name\":\"\u6ab8\u6aac\u7238\",\"@id\":\"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b\"},\"headline\":\"[LLM] Spark + Local vLLM Server\",\"datePublished\":\"2025-12-27T16:12:54+00:00\",\"dateModified\":\"2025-12-27T16:13:33+00:00\",\"mainEntityOfPage\":{\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\"},\"wordCount\":116,\"commentCount\":0,\"publisher\":{\"@id\":\"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b\"},\"image\":{\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png\",\"keywords\":[\"Inference\",\"Python\",\"Spark\",\"vLLM\"],\"articleSection\":[\"Big Data &amp; Machine Learning\",\"IT Technology\",\"Kubernetes\",\"Python\"],\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"CommentAction\",\"name\":\"Comment\",\"target\":[\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#respond\"]}]},{\"@type\":\"WebPage\",\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\",\"url\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\",\"name\":\"[LLM] Spark + Local vLLM Server - \u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane\",\"isPartOf\":{\"@id\":\"https:\/\/myoceane.fr\/#website\"},\"primaryImageOfPage\":{\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage\"},\"image\":{\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage\"},\"thumbnailUrl\":\"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png\",\"datePublished\":\"2025-12-27T16:12:54+00:00\",\"dateModified\":\"2025-12-27T16:13:33+00:00\",\"description\":\"\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002\",\"breadcrumb\":{\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/\"]}]},{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage\",\"url\":\"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png\",\"contentUrl\":\"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png\",\"width\":1368,\"height\":659},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/myoceane.fr\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"[LLM] Spark + Local vLLM Server\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/myoceane.fr\/#website\",\"url\":\"https:\/\/myoceane.fr\/\",\"name\":\"M-Y-Oceane \u60f3\u65b9\u6d89\u6cd5\u3002\u91cf\u74f6\u5916\u7684\u5929\u7a7a\",\"description\":\"\u60f3\u65b9\u6d89\u6cd5, France, Taiwan, Health, Information Technology\",\"publisher\":{\"@id\":\"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b\"},\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/myoceane.fr\/?s={search_term_string}\"},\"query-input\":{\"@type\":\"PropertyValueSpecification\",\"valueRequired\":true,\"valueName\":\"search_term_string\"}}],\"inLanguage\":\"en-US\"},{\"@type\":[\"Person\",\"Organization\"],\"@id\":\"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b\",\"name\":\"\u6ab8\u6aac\u7238\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/myoceane.fr\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/6cc678684664f8ad45a8d56a6630b183?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/6cc678684664f8ad45a8d56a6630b183?s=96&d=mm&r=g\",\"caption\":\"\u6ab8\u6aac\u7238\"},\"logo\":{\"@id\":\"https:\/\/myoceane.fr\/#\/schema\/person\/image\/\"},\"url\":\"https:\/\/myoceane.fr\/index.php\/author\/johnny5584767gmail-com\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"[LLM] Spark + Local vLLM Server - \u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane","description":"\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/","og_locale":"en_US","og_type":"article","og_title":"[LLM] Spark + Local vLLM Server - \u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane","og_description":"\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002","og_url":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/","og_site_name":"\u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane","article_published_time":"2025-12-27T16:12:54+00:00","article_modified_time":"2025-12-27T16:13:33+00:00","og_image":[{"width":1368,"height":659,"url":"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png","type":"image\/png"}],"author":"\u6ab8\u6aac\u7238","twitter_card":"summary_large_image","twitter_misc":{"Written by":"\u6ab8\u6aac\u7238","Est. reading time":"1 minute"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"Article","@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#article","isPartOf":{"@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/"},"author":{"name":"\u6ab8\u6aac\u7238","@id":"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b"},"headline":"[LLM] Spark + Local vLLM Server","datePublished":"2025-12-27T16:12:54+00:00","dateModified":"2025-12-27T16:13:33+00:00","mainEntityOfPage":{"@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/"},"wordCount":116,"commentCount":0,"publisher":{"@id":"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b"},"image":{"@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage"},"thumbnailUrl":"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png","keywords":["Inference","Python","Spark","vLLM"],"articleSection":["Big Data &amp; Machine Learning","IT Technology","Kubernetes","Python"],"inLanguage":"en-US","potentialAction":[{"@type":"CommentAction","name":"Comment","target":["https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#respond"]}]},{"@type":"WebPage","@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/","url":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/","name":"[LLM] Spark + Local vLLM Server - \u60f3\u65b9\u6d89\u6cd5 - \u91cf\u74f6\u5916\u7684\u5929\u7a7a M-Y-Oceane","isPartOf":{"@id":"https:\/\/myoceane.fr\/#website"},"primaryImageOfPage":{"@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage"},"image":{"@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage"},"thumbnailUrl":"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png","datePublished":"2025-12-27T16:12:54+00:00","dateModified":"2025-12-27T16:13:33+00:00","description":"\u524d\u9663\u5b50\u63a5\u6536\u5230 Nvidia \u5206\u4eab\u7684\u9019\u7bc7 Blog\uff0c\u00a0Accelerate Deep Learning and LLM Inference with Apache Spark in the Cloud\uff0c\u958b\u555f\u4e86\u6ab8\u6aac\u7238\u5728\u7d50\u5408 Spark \u8207 Deep Learning\/LLM \u7684\u60f3\u50cf\uff0c\u914d\u5408\u4e00\u4e9b\u4e4b\u524d\u5be6\u4f5c\u904e vLLM \u7684\u7d93\u9a57\uff0c\u672c\u7bc7\u7d00\u9304\u5229\u7528 Spark + Local vLLM Server \u9054\u6210\u52a0\u901f\u6279\u6b21\u63a8\u8ad6\u7684\u76ee\u7684\u904e\u7a0b\u4e2d\u9047\u5230\u7684\u7a2e\u7a2e\u5751\u3002","breadcrumb":{"@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/"]}]},{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#primaryimage","url":"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png","contentUrl":"https:\/\/myoceane.fr\/wp-content\/uploads\/2025\/12\/spark-server-mg.png","width":1368,"height":659},{"@type":"BreadcrumbList","@id":"https:\/\/myoceane.fr\/index.php\/llm-spark-local-vllm-server\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/myoceane.fr\/"},{"@type":"ListItem","position":2,"name":"[LLM] Spark + Local vLLM Server"}]},{"@type":"WebSite","@id":"https:\/\/myoceane.fr\/#website","url":"https:\/\/myoceane.fr\/","name":"M-Y-Oceane \u60f3\u65b9\u6d89\u6cd5\u3002\u91cf\u74f6\u5916\u7684\u5929\u7a7a","description":"\u60f3\u65b9\u6d89\u6cd5, France, Taiwan, Health, Information Technology","publisher":{"@id":"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b"},"potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/myoceane.fr\/?s={search_term_string}"},"query-input":{"@type":"PropertyValueSpecification","valueRequired":true,"valueName":"search_term_string"}}],"inLanguage":"en-US"},{"@type":["Person","Organization"],"@id":"https:\/\/myoceane.fr\/#\/schema\/person\/4a4552fb8c27693083d465e12db7658b","name":"\u6ab8\u6aac\u7238","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/myoceane.fr\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/6cc678684664f8ad45a8d56a6630b183?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/6cc678684664f8ad45a8d56a6630b183?s=96&d=mm&r=g","caption":"\u6ab8\u6aac\u7238"},"logo":{"@id":"https:\/\/myoceane.fr\/#\/schema\/person\/image\/"},"url":"https:\/\/myoceane.fr\/index.php\/author\/johnny5584767gmail-com\/"}]}},"amp_enabled":false,"_links":{"self":[{"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/posts\/10234","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/comments?post=10234"}],"version-history":[{"count":38,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/posts\/10234\/revisions"}],"predecessor-version":[{"id":10280,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/posts\/10234\/revisions\/10280"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/media\/10276"}],"wp:attachment":[{"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/media?parent=10234"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/categories?post=10234"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/myoceane.fr\/index.php\/wp-json\/wp\/v2\/tags?post=10234"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}