Created
September 22, 2025 17:56
-
-
Save mdierolf/257d410401d1058657ae5a7728a9ba29 to your computer and use it in GitHub Desktop.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| INFO 09-22 10:52:06 [__init__.py:216] Automatically detected platform cuda. | |
| (APIServer pid=1200988) INFO 09-22 10:52:07 [api_server.py:1801] vLLM API server version 0.10.2rc3.dev236+g38db529f6 | |
| (APIServer pid=1200988) INFO 09-22 10:52:07 [utils.py:328] non-default args: {'model_tag': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'port': 11434, 'enable_auto_tool_choice': True, 'tool_call_parser': 'hermes', 'model': 'Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', 'trust_remote_code': True, 'max_model_len': 262144, 'gpu_memory_utilization': 0.92} | |
| (APIServer pid=1200988) The argument `trust_remote_code` is to be used with Auto classes. It has no effect here and is ignored. | |
| (APIServer pid=1200988) INFO 09-22 10:52:27 [__init__.py:710] Resolved architecture: Qwen3NextForCausalLM | |
| (APIServer pid=1200988) `torch_dtype` is deprecated! Use `dtype` instead! | |
| (APIServer pid=1200988) INFO 09-22 10:52:27 [__init__.py:1769] Using max model len 262144 | |
| (APIServer pid=1200988) INFO 09-22 10:52:31 [scheduler.py:222] Chunked prefill is enabled with max_num_batched_tokens=8192. | |
| (APIServer pid=1200988) INFO 09-22 10:52:31 [config.py:310] Hybrid or mamba-based model detected: disabling prefix caching since it is not yet supported. | |
| (APIServer pid=1200988) INFO 09-22 10:52:31 [config.py:321] Hybrid or mamba-based model detected: setting cudagraph mode to FULL_AND_PIECEWISE in order to optimize performance. | |
| (APIServer pid=1200988) INFO 09-22 10:52:31 [config.py:390] Setting attention block size to 544 tokens to ensure that attention page size is >= mamba page size. | |
| (APIServer pid=1200988) INFO 09-22 10:52:31 [config.py:411] Padding mamba page size by 1.49% to ensure that mamba page size and attention page size are exactly equal. | |
| INFO 09-22 10:52:37 [__init__.py:216] Automatically detected platform cuda. | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:39 [core.py:648] Waiting for init message from front-end. | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:39 [core.py:75] Initializing a V1 LLM engine (v0.10.2rc3.dev236+g38db529f6) with config: model='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', speculative_config=None, tokenizer='Qwen/Qwen3-Next-80B-A3B-Instruct-FP8', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=262144, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, data_parallel_size=1, disable_custom_all_reduce=False, quantization=fp8, enforce_eager=False, kv_cache_dtype=auto, device_config=cuda, structured_outputs_config=StructuredOutputsConfig(backend='auto', disable_fallback=False, disable_any_whitespace=False, disable_additional_properties=False, reasoning_parser=''), observability_config=ObservabilityConfig(show_hidden_metrics_for_version=None, otlp_traces_endpoint=None, collect_detailed_traces=None), seed=0, served_model_name=Qwen/Qwen3-Next-80B-A3B-Instruct-FP8, enable_prefix_caching=False, chunked_prefill_enabled=True, use_async_output_proc=True, pooler_config=None, compilation_config={"level":3,"debug_dump_path":"","cache_dir":"","backend":"","custom_ops":[],"splitting_ops":["vllm.unified_attention","vllm.unified_attention_with_output","vllm.mamba_mixer2","vllm.mamba_mixer","vllm.short_conv","vllm.linear_attention","vllm.plamo2_mamba_mixer","vllm.gdn_attention"],"use_inductor":true,"compile_sizes":[],"inductor_compile_config":{"enable_auto_functionalized_v2":false},"inductor_passes":{},"cudagraph_mode":[2,1],"use_cudagraph":true,"cudagraph_num_of_warmups":1,"cudagraph_capture_sizes":[512,504,496,488,480,472,464,456,448,440,432,424,416,408,400,392,384,376,368,360,352,344,336,328,320,312,304,296,288,280,272,264,256,248,240,232,224,216,208,200,192,184,176,168,160,152,144,136,128,120,112,104,96,88,80,72,64,56,48,40,32,24,16,8,4,2,1],"cudagraph_copy_inputs":false,"full_cuda_graph":false,"pass_config":{},"max_capture_size":512,"local_cache_dir":null} | |
| (EngineCore_DP0 pid=1201210) W0922 10:52:39.439000 1201210 torch/utils/cpp_extension.py:2425] TORCH_CUDA_ARCH_LIST is not set, all archs for visible cards are included for compilation. | |
| (EngineCore_DP0 pid=1201210) W0922 10:52:39.439000 1201210 torch/utils/cpp_extension.py:2425] If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'] to specific architectures. | |
| [W922 10:52:44.453574974 ProcessGroupNCCL.cpp:981] Warning: TORCH_NCCL_AVOID_RECORD_STREAMS is the default now, this environment variable is thus deprecated. (function operator()) | |
| [rank0]:[W922 10:52:44.846799434 ProcessGroupGloo.cpp:514] Warning: Unable to resolve hostname to a (local) address. Using the loopback address as fallback. Manually set the network interface to bind to with GLOO_SOCKET_IFNAME. (function operator()) | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| [Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0 | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:53 [parallel_state.py:1206] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0, EP rank 0 | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:53 [topk_topp_sampler.py:58] Using FlashInfer for top-p & top-k sampling. | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:53 [gpu_model_runner.py:2434] Starting to load model Qwen/Qwen3-Next-80B-A3B-Instruct-FP8... | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:53 [gpu_model_runner.py:2466] Loading model from scratch... | |
| (EngineCore_DP0 pid=1201210) `torch_dtype` is deprecated! Use `dtype` instead! | |
| (EngineCore_DP0 pid=1201210) WARNING 09-22 10:52:53 [fp8.py:455] Failed to import DeepGemm kernels. | |
| (EngineCore_DP0 pid=1201210) WARNING 09-22 10:52:53 [fp8.py:478] CutlassBlockScaledGroupedGemm not supported on the current platform. | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:53 [cuda.py:368] Using Flash Attention backend on V1 engine. | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:56 [weight_utils.py:348] Using model weights format ['*.safetensors'] | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:52:58 [weight_utils.py:369] Time spent downloading weights for Qwen/Qwen3-Next-80B-A3B-Instruct-FP8: 2.068584 seconds | |
| Loading safetensors checkpoint shards: 0% Completed | 0/8 [00:00<?, ?it/s] | |
| Loading safetensors checkpoint shards: 12% Completed | 1/8 [00:02<00:19, 2.77s/it] | |
| Loading safetensors checkpoint shards: 25% Completed | 2/8 [00:07<00:23, 3.93s/it] | |
| Loading safetensors checkpoint shards: 38% Completed | 3/8 [00:12<00:22, 4.41s/it] | |
| Loading safetensors checkpoint shards: 50% Completed | 4/8 [00:16<00:17, 4.38s/it] | |
| Loading safetensors checkpoint shards: 62% Completed | 5/8 [00:19<00:11, 3.92s/it] | |
| Loading safetensors checkpoint shards: 75% Completed | 6/8 [00:23<00:07, 3.65s/it] | |
| Loading safetensors checkpoint shards: 88% Completed | 7/8 [00:27<00:04, 4.03s/it] | |
| Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:32<00:00, 4.29s/it] | |
| Loading safetensors checkpoint shards: 100% Completed | 8/8 [00:32<00:00, 4.09s/it] | |
| (EngineCore_DP0 pid=1201210) | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:36 [default_loader.py:268] Loading weights took 33.04 seconds | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:37 [gpu_model_runner.py:2488] Model loading took 74.8852 GiB and 43.043391 seconds | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:40 [backends.py:539] Using cache directory: /mnt/media/llm-cache/vllm/torch_compile_cache/5150e1ef30/rank_0_0/backbone for vLLM's torch.compile | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:40 [backends.py:550] Dynamo bytecode transform time: 3.04 s | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:40 [backends.py:194] Cache the graph for dynamic shape for later use | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:42 [backends.py:215] Compiling a graph for dynamic shape takes 1.83 s | |
| (EngineCore_DP0 pid=1201210) WARNING 09-22 10:53:43 [fused_moe.py:728] Using default MoE config. Performance might be sub-optimal! Config file not found at ['/home/mdierolf/gitprojects/vllm/vllm/model_executor/layers/fused_moe/configs/E=512,N=512,device_name=NVIDIA_RTX_PRO_6000_Blackwell_Max-Q_Workstation_Edition,dtype=fp8_w8a8,block_shape=[128,128].json'] | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:43 [monitor.py:34] torch.compile takes 4.86 s in total | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:44 [gpu_worker.py:299] Available KV cache memory: 10.33 GiB | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:44 [kv_cache_utils.py:1034] GPU KV cache size: 112,608 tokens | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:53:44 [kv_cache_utils.py:1038] Maximum concurrency for 262,144 tokens per request: 1.71x | |
| (EngineCore_DP0 pid=1201210) 2025-09-22 10:53:44,836 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ... | |
| (EngineCore_DP0 pid=1201210) 2025-09-22 10:53:44,999 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends | |
| Capturing CUDA graphs (mixed prefill-decode, PIECEWISE): 100%|███████████████████████████████████████████████████████████████████████████| 67/67 [00:03<00:00, 16.80it/s] | |
| Capturing CUDA graphs (decode, FULL): 100%|██████████████████████████████████████████████████████████████████████████████████████████████| 67/67 [00:38<00:00, 1.74it/s] | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:54:28 [gpu_model_runner.py:3280] Graph capturing finished in 43 secs, took 0.12 GiB | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:54:28 [gpu_worker.py:392] Free memory on device (94.32/94.97 GiB) on startup. Desired GPU memory utilization is (0.92, 87.37 GiB). Actual usage is 74.89 GiB for weight, 2.08 GiB for peak activation, 0.08 GiB for non-torch memory, and 0.12 GiB for CUDAGraph memory. Replace gpu_memory_utilization config with `--kv-cache-memory=10809381580` to fit into requested memory, or `--kv-cache-memory=18271979008` to fully utilize gpu memory. Current kv cache memory in use is 11090399948 bytes. | |
| (EngineCore_DP0 pid=1201210) INFO 09-22 10:54:28 [core.py:214] init engine (profile, create kv cache, warmup model) took 50.83 seconds | |
| (APIServer pid=1200988) INFO 09-22 10:54:29 [loggers.py:142] Engine 000: vllm cache_config_info with initialization after num_gpu_blocks is: 829 | |
| (APIServer pid=1200988) INFO 09-22 10:54:29 [api_server.py:1597] Supported_tasks: ['generate'] | |
| (APIServer pid=1200988) WARNING 09-22 10:54:29 [__init__.py:1648] Default sampling parameters have been overridden by the model's Hugging Face generation config recommended from the model creator. If this is not intended, please relaunch vLLM instance with `--generation-config vllm`. | |
| (APIServer pid=1200988) INFO 09-22 10:54:29 [serving_responses.py:135] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8} | |
| (APIServer pid=1200988) INFO 09-22 10:54:29 [serving_responses.py:164] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored. | |
| (APIServer pid=1200988) INFO 09-22 10:54:29 [serving_chat.py:97] "auto" tool choice has been enabled please note that while the parallel_tool_calls client option is preset for compatibility reasons, it will be ignored. | |
| (APIServer pid=1200988) INFO 09-22 10:54:29 [serving_chat.py:137] Using default chat sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8} | |
| (APIServer pid=1200988) INFO 09-22 10:54:30 [serving_completion.py:76] Using default completion sampling params from model: {'temperature': 0.7, 'top_k': 20, 'top_p': 0.8} | |
| (APIServer pid=1200988) INFO 09-22 10:54:30 [api_server.py:1876] Starting vLLM API server 0 on http://0.0.0.0:11434 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment