【“星睿O6”AI PC开发套件评测】本地模型部署，性能测试，VS code辅助编码，MNN 玩法

O6 的 npu 具有约 30 TOPS 算力
- 支持 INT4 / INT8 / INT16 / FP16 / BF16 / TF32 等计算精度

目前官方支持了

LLM问答
ai图片识别查询
语音文本多模态
图片文本多模态
文生图
图片文本多模态，基于可视化 web ui 库 streamlit 开发
图片文本多模态，基于可视化 web ui 库 gradio 开发，使用LLM模型未用于训练的专有信息来辅助LLM文本生成
本地网页聊天机器人
基于vscode的代码辅助生成

等功能和工具

在这里分享一下运行部分工具的演示和运行本地推理大模型的能力

Llama.cpp

llama.cpp 是一个用纯 C/C++ 编写的高性能开源推理框架，专为在本地和云端多种硬件（尤其是CPU）上高效运行大型语言模型而设计

官方 release 的 Debian 12 自带了编译好的 llama.cpp，包名为 cix-llama-cpp，可以直接运行，如果你正在运行其他系统，也可以自己编译

clone llama.cpp 源码

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp

可以选择使用 CPU 推理或者使用 GPU 来推理

使用 CPU 推理

使用它针对 O6 开发板针对性优化的编译参数编译

mkdir build && cd build
cmake \
    -DLLAMA_CURL=OFF \
    -DGGML_LLAMAFILE=OFF \
    -DGGML_VULKAN=OFF \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_SYSTEM_PROCESSOR=armv9-a \
    -DCMAKE_OSX_ARCHITECTURES=arm64 \
    -DGGML_NATIVE=OFF \
    -DGGML_AVX=off \
    -DGGML_AVX2=off \
    -DGGML_AVX512=off \
    -DGGML_FMA=off \
    -DGGML_F16C=off \
    -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod+sve \
    -DGGML_CPU_KLEIDIAI=ON \
    ..

make -j

使用 GPU 推理

使用它针对 O6 开发板针对性优化的编译参数编译

mkdir build_vulkan && cd build_vulkan
cmake \
    -DLLAMA_CURL=OFF \
    -DGGML_LLAMAFILE=OFF \
    -DGGML_VULKAN=ON \
    -DVulkan_LIBRARY=${PATH_SYSROOT}/usr/lib/aarch64-linux-gnu/libvulkan.so \
    -DVulkan_INCLUDE_DIR=${PATH_SYSROOT}/usr/include \
    -DVulkan_GLSLC_EXECUTABLE=${PATH_ROOT}/build-scripts/bin/vulkan_tools/glslc \
    -DBUILD_SHARED_LIBS=OFF \
    -DCMAKE_SYSROOT=${PATH_SYSROOT} \
    -DCMAKE_C_COMPILER=${CROSS_COMPILE}gcc \
    -DCMAKE_CXX_COMPILER=${CROSS_COMPILE}g++ \
    -DGGML_NATIVE=OFF \
    -DGGML_AVX=off \
    -DGGML_AVX2=off \
    -DGGML_AVX512=off \
    -DGGML_FMA=off \
    -DGGML_F16C=off \
    -DGGML_CPU_ARM_ARCH=armv9-a+sve \
    -DCMAKE_SYSTEM_PROCESSOR=armv9-a \
    -DCMAKE_OSX_ARCHITECTURES=arm64 \
    -DCMAKE_SYSTEM_NAME=Linux \
    ..

make -j

llama-bench 是一个对各种参数下的推理性能进行基准测试的工具，我们使用量化模型来测试一下 CPU 和 GPU 推理性能，以及绑定到大核运行的性能

本次测试我们使用 Qwen3-4B-Q4_K_M.gguf 模型

使用 CPU 推理，不绑定大核的负载情况

运行结果

➜  ~ llama-bench -m ./Qwen3-4B-Q4_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | CPU        |      12 |           pp512 |         37.43 ± 0.29 |
| qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | CPU        |      12 |           tg128 |          6.50 ± 0.02 |

build: 9fbb6c95 (6973)

第一项测试：pp512 (Prompt Processing - 提示词处理)
- t/s: 37.43 ± 0.29：这项测试衡量的是模型处理输入（提示词）的速度。具体指模型在处理一个长度为 512个token 的提示词时，每秒能处理多少token。
第二项测试：tg128 (Token Generation - 令牌生成)
- t/s: 6.50 ± 0.02：这项测试衡量的是模型生成新内容的速度。具体指在给定上下文后，模型生成 128个新token 的平均速度。

使用 CPU 推理，绑定在大核上的负载情况

运行结果

➜  ~ taskset -c 0,5,6,7,8,9,10,11 /usr/share/cix/bin/llama-bench -t 8 -m ./Qwen3-4B-Q4_K_M.gguf
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | CPU        |       8 |           pp512 |         30.06 ± 0.06 |
| qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | CPU        |       8 |           tg128 |         10.05 ± 0.03 |

build: 9fbb6c95 (6973)

测试配置差异

第一次测试：12个线程（默认绑定所有CPU核心）
第二次测试：8个线程（通过taskset绑定到大核 0,5,6,7,8,9,10,11）

性能对比

1. Prompt Processing (pp512)

12线程：37.43 ± 0.29 t/s
8线程：30.06 ± 0.06 t/s
性能下降：约19.7%

2. Token Generation (tg128)

12线程：6.50 ± 0.02 t/s
8线程：10.05 ± 0.03 t/s
性能提升：约54.6%

实际应用中如果应用主要是生成文本，可以绑定为8大核

如果需要频繁处理长提示，保持12线程更好

使用 GPU 推理

➜  ~ llama-bench-vulkan -m ./Qwen3-4B-Q4_K_M.gguf
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
| qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | Vulkan     |  99 |           pp512 |        158.60 ± 0.41 |
| qwen3 4B Q4_K - Medium         |   2.32 GiB |     4.02 B | Vulkan     |  99 |           tg128 |         12.16 ± 0.02 |

build: 9fbb6c95 (6973)

结果可见使用 GPU（Mali-G720）相比纯 CPU ，带来了巨大性能提升。

对比

测试项目	CPU 结果 (t/s)	Vulkan GPU 结果 (t/s)	性能提升倍数	结果
`pp512` (提示处理)	37.43	158.60	约 4.2 倍	GPU 并行计算优势巨大，速度快了4倍多。
`tg128` (令牌生成)	6.50	12.16	约 1.9 倍	生成任务对顺序依赖强，GPU加速有限，但速度也快了近一倍，感知明显。

实际测试

CPU 推理，不绑定大核

llama-cli -m ./Qwen3-4B-Q4_K_M.gguf -n 128 -c 4096 -p "hello, how are you"
...
user
hello, how are you
assistant
<think>
Okay, the user said "hello, how are you". Let me think about how to respond. First, I should greet them back in a friendly and welcoming manner. Since I'm an AI, I can't have feelings, but I can simulate a positive response.

I need to make sure my reply is in the same language as the user's message, which is English. The user might be testing if I can respond appropriately or just starting a conversation. I should keep the tone cheerful and open to continue the conversation.

I should also consider if there's any cultural context or specific reason they're asking. Maybe they want to
>
llama_perf_sampler_print:    sampling time =      17.15 ms /   141 runs   (    0.12 ms per token,  8219.66 tokens per second)
llama_perf_context_print:        load time =     520.77 ms
llama_perf_context_print: prompt eval time =     605.38 ms /    13 tokens (   46.57 ms per token,    21.47 tokens per second)
llama_perf_context_print:        eval time =   18538.55 ms /   127 runs   (  145.97 ms per token,     6.85 tokens per second)
llama_perf_context_print:       total time =   31963.27 ms /   140 tokens
llama_perf_context_print:    graphs reused =        127
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 3253 =  2375 +     576 +     301                |

性能指标解读

1. 采样时间 (sampling time)

sampling time = 17.15 ms / 141 runs (0.12 ms per token, 8219.66 tokens per second)

采样速率：8219 tokens/秒

2. 加载时间 (load time)

load time = 520.77 ms

从磁盘加载模型文件到内存的时间
模型加载时间：约0.52秒

3. 提示评估时间 (prompt eval time)

prompt eval time = 605.38 ms / 13 tokens (46.57 ms per token, 21.47 tokens per second)

用户输入的13个token处理时间：0.6秒
速度：21.47 tokens/秒
这是第一次前向传播（处理prompt）

4. 生成评估时间 (eval time)

eval time = 18538.55 ms / 127 runs (145.97 ms per token, 6.85 tokens per second)

生成127个token的时间：18.54秒
速度：6.85 tokens/秒

5. 总时间 (total time)

total time = 31963.27 ms / 140 tokens

处理140个token（13+127）的总时间：31.96秒
平均速度：~4.38 tokens/秒

6. 图形复用 (graphs reused)

graphs reused = 127

计算图复用次数：127次
这是自回归生成的特点：每个新token都复用了相同的计算图

7. 内存使用情况

| memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
|   - Host               |                 3253 =  2375 +     576 +     301                |

总内存：3253 MiB (约3.25 GB)
模型内存：2375 MiB (2.32 GB) - 模型参数
上下文内存：576 MiB - 用于存储上下文
计算内存：301 MiB - 中间计算结果

CPU 推理，绑定大核

taskset -c 0,5,6,7,8,9,10,11 llama-cli -m ./Qwen3-4B-Q4_K_M.gguf -t 8 -n 128 -c 4096 -p "hello, how are you"
...
user
hello, how are you
assistant
<think>
Okay, the user said, "hello, how are you." That's a friendly greeting. I need to respond in a way that's warm and engaging. Let me start by acknowledging their greeting. I should say hello back and ask how they're doing. But I also want to add a bit more to make the conversation flow. Maybe mention that I'm here to help and offer assistance. I should keep it simple and positive. Avoid any technical jargon. Make sure the response is friendly and open-ended. Let me check for any possible improvements. Maybe add an emoji to keep it friendly, but not too much. Alright
>
llama_perf_sampler_print:    sampling time =      17.69 ms /   141 runs   (    0.13 ms per token,  7972.41 tokens per second)
llama_perf_context_print:        load time =     514.51 ms
llama_perf_context_print: prompt eval time =     709.86 ms /    13 tokens (   54.60 ms per token,    18.31 tokens per second)
llama_perf_context_print:        eval time =   12092.45 ms /   127 runs   (   95.22 ms per token,    10.50 tokens per second)
llama_perf_context_print:       total time =   14657.30 ms /   140 tokens
llama_perf_context_print:    graphs reused =        127
llama_memory_breakdown_print: | memory breakdown [MiB] | total   free    self   model   context   compute    unaccounted |
llama_memory_breakdown_print: |   - Host               |                 3253 =  2375 +     576 +     301                |

关键指标对比

性能指标	未绑定	绑定大核	性能变化
总用时 (`total time`)	31963.27 ms	14657.30 ms	缩短 54.1% (快1.18倍)
生成速度 (`eval time` per token)	145.97 ms	95.22 ms	耗时减少 34.8%
Token生成率 (`eval` tokens/sec)	6.85	10.50	提升 53.3%
提示处理速度 (`prompt` tokens/sec)	21.47	18.31	略有下降 (约14.7%)
模型加载时间 (`load time`)	520.77 ms	514.51 ms	基本持平
采样速度 (`sampling` tokens/sec)	8219.66	7972.41	基本持平
总主机内存 (`total Host`)	3253 MiB	3253 MiB	完全相同
模型内存 (`model`)	2375 MiB	2375 MiB	完全相同
上下文内存 (`context`)	576 MiB	576 MiB	完全相同

Coding Assistant :

DeepSeek_R1 + llamacpp + VSCode + Continue_plugin

Continue 是一款开源的 AI 代码助手插件，它本身不提供AI模型，而是作为一个桥梁，让你能在 VS Code 或 JetBrains 等 IDE 中灵活接入和使用各种主流 AI 模型

我们在 O6 上通过 llamacpp server 跑一个本地模型，然后在 O6 上使用 VS Code 和 Contrinue 插件接入本地模型，让它来辅助我们编写代码，聊天等

首先下载 VS Code for arm64

VS Code 官网 https://code.visualstudio.com...

下载 for arm64 的 deb 包和本地模型，推荐使用 7B 以下模型

# 下载后安装 code
sudo apt install ./code_1.107.0-1765353545_arm64.deb

打开 Code ，打开插件，搜索 continue ，安装

Screenshot from 2025-12-11 15-06-59.png

配置选择 Local

Screenshot from 2025-12-11 15-08-48.png

添加 ~/.continue/config.json 文件，内容

{
  "models": [
    {
      "model": "claude-3-5-sonnet-latest",
      "provider": "anthropic",
      "apiKey": "",
      "title": "Claude 3.5 Sonnet"
    },
    {
      "title": "LLaMA",
      "provider": "llama.cpp",
      "model": "deepseek-7b",
      "apiBase": "http://127.0.0.1:8080"
    }
  ],
  "tabAutocompleteModel": {
    "title": "Codestral",
    "provider": "mistral",
    "model": "codestral-latest",
    "apiKey": ""
 },
}

然后使用 llama-server 调用本地模型，打开 server 服务即可

taskset -c 0,5,6,7,8,9,10,11 /usr/share/cix/bin/llama-server -m ./DeepSeek-R1-Distill-Qwen-7B-Q4_0.gguf -t 8 -c 4096

测试

Screenshot from 2025-12-11 15-56-02.png

MNN CLI

基于MNN开发的LLM推理引擎，支持目前主流的开源LLM模型。该功能分为2部分：

模型导出：将torch模型导出为onnx，然后转换为mnn模型；导出tokenizer文件，embedding等文件；
模型推理：支持导出的模型推理，支持LLM模型的文本生成；

模型导出

llm_export是一个llm模型导出工具，能够将llm模型导出为onnx和mnn模型。

用法

将需要导出的LLM项目clone到本地，如：Qwen2-0.5B-Instruct

git clone https://www.modelscope.cn/qwen/Qwen2-0.5B-Instruct.git

执行llm_export.py导出模型

cd ./transformers/llm/export
# 导出模型，tokenizer和embedding，并导出对应的mnn模型
python llm_export.py \
        --type Qwen2-0_5B-Instruct \
        --path /path/to/Qwen2-0.5B-Instruct \
        --export \
        --export_token \
        --export_embed --embed_bin \
        --export_mnn

导出产物
导出产物为：
embeddings_bf16.bin: 模型的embedding权重二进制文件，推理时使用；
llm_config.json: 模型的配置信息，推理时使用；
llm.onnx: 模型的onnx文件，推理时不使用；
tokenizer.txt: 模型的tokenzier文件，推理时使用；
llm.mnn: 模型的mnn文件，推理时使用；
llm.mnn.weight: 模型的mnn权重，推理时使用；
目录结构如下所示：

.
├── onnx
|    ├── embeddings_bf16.bin
|    ├── llm_config.json
|    ├── llm.onnx
|    └── tokenizer.txt
└── mnn
     ├── llm.mnn
     └── llm.mnn.weight

功能

支持将模型完整导出为一个onnx模型，使用--export
支持将模型分段导出为多个模型，使用--export_split
支持导出模型的词表到一个文本文件，每行代表一个token；其中token使用base64编码；使用--export_verbose
支持导出模型的Embedding层为一个onnx模型，使用--export_embed，同时支持bf16格式，使用--embed_bf16
支持分层导出模型的block，使用--export_blocks导出全部层；使用--export_block $id导出指定层
支持导出模型的lm_head层为一个onnx模型，使用--export_lm
支持导出多模态模型的visual模型为一个onnx模型，使用--export_visual
支持对模型进行对话测试，使用--test $query会返回llm的回复内容
支持在导出onnx模型后使用onnxruntime对结果一致性进行校验，使用--export_test
支持将tokenizer导出为文本文件，使用--export_token
支持将导出的onnx模型转换为mnn模型，默认转换为非对称4bit量化，使用--export_mnn
指定导出路径使用--onnx_path和--mnn_path
默认会使用onnx-slim对onnx模型进行优化，跳过该步骤使用--skip_slim
支持合并lora权重后导出，指定lora权重的目录使用--lora_path

参数

usage: llm_export.py [-h] --path PATH
                     [--type {chatglm-6b,chatglm2-6b,chatglm3-6b,codegeex2-6b,Qwen-7B-Chat,Qwen-1_8B-Chat,Qwen-1_8B,Qwen-VL-Chat,Qwen1_5-0_5B-Chat,Qwen1_5-1_8B-Chat,Qwen1_5-4B-Chat,Qwen1_5-7B-Chat,Qwen2-1_5B-Instruct,Baichuan2-7B-Chat,Llama-2-7b-chat-ms,Llama-3-8B-Instruct,internlm-chat-7b,TinyLlama-1_1B-Chat,Yi-6B-Chat,deepseek-llm-7b-chat,phi-2,bge-large-zh,lora}]
                     [--lora_path LORA_PATH] [--onnx_path ONNX_PATH] [--mnn_path MNN_PATH] [--export_mnn] [--export_verbose] [--export_test] [--test TEST] [--export] [--export_split] [--export_token]
                     [--export_embed] [--export_visual] [--export_lm] [--export_block EXPORT_BLOCK] [--export_blocks] [--embed_bin] [--embed_bf16] [--skip_slim]

llm_exporter

options:
  -h, --help            show this help message and exit
  --path PATH           path(`str` or `os.PathLike`):
                        Can be either:
                            - A string, the *model id* of a pretrained model like `THUDM/chatglm-6b`. [TODO]
                            - A path to a *directory* clone from repo like `../chatglm-6b`.
  --type {chatglm-6b,chatglm2-6b,chatglm3-6b,codegeex2-6b,Qwen-7B-Chat,Qwen-1_8B-Chat,Qwen-1_8B,Qwen-VL-Chat,Qwen1_5-0_5B-Chat,Qwen1_5-1_8B-Chat,Qwen1_5-4B-Chat,Qwen1_5-7B-Chat,Qwen2-1_5B-Instruct,Baichuan2-7B-Chat,Llama-2-7b-chat-ms,Llama-3-8B-Instruct,internlm-chat-7b,TinyLlama-1_1B-Chat,Yi-6B-Chat,deepseek-llm-7b-chat,phi-2,bge-large-zh,lora}
                        type(`str`, *optional*):
                            The pretrain llm model type.
  --lora_path LORA_PATH
                        lora path, defaut is `None` mean not apply lora.
  --onnx_path ONNX_PATH
                        export onnx model path, defaut is `./onnx`.
  --mnn_path MNN_PATH   export mnn model path, defaut is `./mnn`.
  --export_mnn          Whether or not to export mnn model after onnx.
  --export_verbose      Whether or not to export onnx with verbose.
  --export_test         Whether or not to export onnx with test using onnxruntime.
  --test TEST           test model inference with query `TEST`.
  --export              export model to an `onnx` model.
  --export_split        export model split to some `onnx` models:
                            - embedding model.
                            - block models.
                            - lm_head model.
  --export_token        export llm tokenizer to a txt file.
  --export_embed        export llm embedding to an `onnx` model.
  --export_visual       export llm visual model to an `onnx` model.
  --export_lm           export llm lm_head to an `onnx` model.
  --export_block EXPORT_BLOCK
                        export llm block [id] to an `onnx` model.
  --export_blocks       export llm all blocks to `onnx` models.
  --embed_bin           export embedding weight as bin file with dtype `bfloat16`
  --embed_bf16          using `bfloat16` replace `float32` in embedding.
  --skip_slim           Whether or not to skip onnx-slim.

模型推理

使用

运行时配置

运行时文件

将导出产物中用于模型推理的部分置于同一个文件夹下，添加一个配置文件config.json来描述模型名称与推理参数，目录如下：

.
└── model_dir
     ├── config.json
     ├── embeddings_bf16.bin
     ├── llm_config.json
     ├── llm.mnn
     ├── llm.mnn.weight
     └── tokenizer.txt

配置项

配置文件支持以下配置：

模型文件信息
- base_dir: 模型文件加载的文件夹目录，默认为config.json的所在目录，或模型所在目录；
- llm_config: llm_config.json的实际名称路径为base_dir + llm_config，默认为base_dir + 'config.json'
- llm_model: llm.mnn的实际名称路径为base_dir + llm_model，默认为base_dir + 'llm.mnn'
- llm_weight: llm.mnn.weight的实际名称路径为base_dir + llm_weight，默认为base_dir + 'llm.mnn.weight'
- block_model: 分段模型时block_{idx}.mnn的实际路径为base_dir + block_model，默认为base_dir + 'block_{idx}.mnn'
- lm_model: 分段模型时lm.mnn的实际路径为base_dir + lm_model，默认为base_dir + 'lm.mnn'
- embedding_model: 当embedding使用模型时，embedding的实际路径为base_dir + embedding_model，默认为base_dir + 'embedding.mnn'
- embedding_file: 当embedding使用二进制时，embedding的实际路径为base_dir + embedding_file，默认为base_dir + 'embeddings_bf16.bin'
- tokenizer_file: tokenizer.txt的实际名称路径为base_dir + tokenizer_file，默认为base_dir + 'tokenizer.txt'
- visual_model: 当使用VL模型时，visual_model的实际路径为base_dir + visual_model，默认为base_dir + 'visual.mnn'
推理配置
- max_new_tokens: 生成时最大token数，默认为512
硬件配置
- backend_type: 推理使用硬件后端类型，默认为："cpu"
- thread_num: 推理使用硬件线程数，默认为：4
- precision: 推理使用精度策略，默认为："low"，尽量使用fp16
- memory: 推理使用内存策略，默认为："low"，开启运行时量化

配置文件示例

config.json

{
    "llm_model": "qwen2-1.5b-int4.mnn",
    "llm_weight": "qwen2-1.5b-int4.mnn.weight",

    "backend_type": "cpu",
    "thread_num": 4,
    "precision": "low",
    "memory": "low"
}

llm_config.json

{
    "hidden_size": 1536,
    "layer_nums": 28,
    "attention_mask": "float",
    "key_value_shape": [
        2,
        1,
        0,
        2,
        128
    ],
    "prompt_template": "<|im_start|>user\n%s<|im_end|>\n<|im_start|>assistant\n",
    "is_visual": false,
    "is_single": true
}

推理用法

llm_demo的用法如下：

# 使用config.json
## 交互式聊天
./llm_demo model_dir/config.json
## 针对prompt中的每行进行回复
./llm_demo model_dir/config.json prompt.txt

# 不使用config.json, 使用默认配置
## 交互式聊天
./llm_demo model_dir/llm.mnn
## 针对prompt中的每行进行回复
./llm_demo model_dir/llm.mnn prompt.txt

Llama.cpp

使用 CPU 推理

使用 GPU 推理

使用 CPU 推理，不绑定大核的负载情况

使用 CPU 推理，绑定在大核上的负载情况

1. Prompt Processing (pp512)

2. Token Generation (tg128)

使用 GPU 推理

对比

实际测试

性能指标解读

1. 采样时间 (sampling time)

2. 加载时间 (load time)

3. 提示评估时间 (prompt eval time)

4. 生成评估时间 (eval time)

5. 总时间 (total time)

6. 图形复用 (graphs reused)

7. 内存使用情况

关键指标对比

Coding Assistant :

DeepSeek_R1 + llamacpp + VSCode + Continue_plugin

MNN CLI

模型导出

用法

功能

参数

模型推理

使用

运行时配置

运行时文件

配置项

配置文件示例

推理用法

推荐阅读

目录