vllm Qwen3.6-27B-NVFP4 – 三月有晴天

发表于2026年7月4日2026年7月4日作者 isvee
torch-2.11.0+cu130-cp312-cp312-manylinux_2_28_x86_64.whl
torchvision-0.26.0+cu130-cp312-cp312-manylinux_2_28_x86_64.whl
torchaudio-2.11.0+cu130-cp312-cp312-manylinux_2_28_x86_64.whl

https://github.com/vllm-project/vllm/releases/download/v0.24.0/vllm-0.24.0-cp38-abi3-manylinux_2_28_x86_64.whl
不要下载cu129版

出现报错：
ModuleNotFoundError: No module named 'distutils'
退出终端重新进入即可

官方示例：
VLLM_USE_MODELSCOPE=true
vllm serve Qwen/Qwen3.6-27B-FP8 --port 8099 --max-model-len 262144 --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_coder
vllm serve Qwen/Qwen3.6-27B-FP8 --port 8099 --max-model-len 262144 --reasoning-parser qwen3 --speculative-config '{"method":"qwen3_next_mtp","num_speculative_tokens":2}'

实测可用：
vllm serve /data/G/LLM/Qwen3.6-27B-NVFP4_nvidia \
  --served-model-name Qwen3.6-27B-NVFP4 \
  --host 0.0.0.0 \
  --port 8099 \
  --max-model-len 215040 \
  --quantization modelopt \
  --trust-remote-code \
  --gpu-memory-utilization 0.95 \
  --language-model-only \
  --kv-cache-dtype fp8 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --max-num-seqs 4 \
  --max-num-batched-tokens 2048
Avg prompt throughput: 5196.7 tokens/s, Avg generation throughput: 33.8 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.2%, Prefix cache hit rate: 64.0%
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.2%, Prefix cache hit rate: 64.0%
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.7 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.9%, Prefix cache hit rate: 64.0%
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 65.6 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 23.9%, Prefix cache hit rate: 64.0%
Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 52.1 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 7.7%, Prefix cache hit rate: 50.9%

添加 "contextWindowSize" 到~/.qwen/settings.json，例如：
  "modelProviders": {
    "openai": [
      {
        "id": "Qwen3.6-27B-NVFP4",
        "name": "Qwen3.6-27B-NVFP4",
        "baseUrl": "http://127.0.0.1:8099/v1",
        "envKey": "QWEN_CUSTOM_API_KEY_OPENAI_HTTP_127_0_0_1_8099_12956CC0EA0E",
        "generationConfig": {
          "extra_body": {
            "enable_thinking": true
          },
          "contextWindowSize": 215040
        }
      }
    ]
  },

网上有人说这个模型能达到138t/s，没实际测试
--model /models/Huihui-Qwen3.6-27B-abliterated-NVFP4-TEXT-MTP --served-model-name Qwen3.6-27B --trust-remote-code --quantization modelopt --language-model-only --max-model-len 236800 --max-num-seqs 2 --kv-cache-dtype fp8 --gpu-memory-utilization 0.95 --speculative-config {"method":"qwen3_5_mtp","num_speculative_tokens":3} --dtype auto