编译与适配

O6的GPU的OpenCL跑不了llama.cpp的ocl算子，所以我们只能用vulkan

编译教程：

sudo apt install cmake libcurl4-openssl-dev clang libomp-dev glslc

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
mkdir build && cd build

cmake .. \
  -DCMAKE_CXX_FLAGS="-march=armv9-a" \
  -DCMAKE_C_COMPILER=/usr/bin/clang \
  -DCMAKE_CXX_COMPILER=/usr/bin/clang++ \
  -DCMAKE_C_FLAGS="-march=armv9-a" \
  -DGGML_CPU_KLEIDIAI=ON \
  -DCMAKE_BUILD_TYPE=Release \
  -DBUILD_SHARED_LIBS=OFF \
  -DGGML_VULKAN=ON
make -j12

wget https://modelscope.cn/models/prithivMLmods/Llama-3.2-3B-Instruct-GGUF/resolve/master/Llama-3.2-3B-Instruct.Q8_0.gguf

llama.cpp 必须使用master版本，用老版本会有编译bug

clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [examples/batched/CMakeFiles/llama-batched.dir/build.make:107: bin/llama-batched] Error 1
make[1]: *** [CMakeFiles/Makefile2:3029: examples/batched/CMakeFiles/llama-batched.dir/all] Error 2
/usr/bin/ld: ../../ggml/src/ggml-vulkan/libggml-vulkan.a(ggml-vulkan.cpp.o): in function `ggml_vk_load_shaders(std::shared_ptr<vk_device_struct>&)':
ggml-vulkan.cpp:(.text+0x3c310): undefined reference to `rope_norm_f16_rte_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c314): undefined reference to `rope_norm_f16_rte_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c338): undefined reference to `rope_norm_f16_rte_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c358): undefined reference to `rope_norm_f16_rte_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c380): undefined reference to `rope_neox_f16_rte_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c38c): undefined reference to `rope_neox_f16_rte_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c3b4): undefined reference to `rope_neox_f16_rte_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c3d4): undefined reference to `rope_neox_f16_rte_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c5e0): undefined reference to `rope_norm_f16_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c5e4): undefined reference to `rope_norm_f16_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c608): undefined reference to `rope_norm_f16_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c628): undefined reference to `rope_norm_f16_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c650): undefined reference to `rope_neox_f16_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c65c): undefined reference to `rope_neox_f16_len'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c684): undefined reference to `rope_neox_f16_data'
/usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c6a4): undefined reference to `rope_neox_f16_data'

这是由于

cannot compile rope_norm_f16

/bin/glslc -fshader-stage=compute --target-env=vulkan1.2 -O /home/cix/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp -o /home/cix/llama.cpp/build/ggml/src/ggml-vulkan/vulkan-shaders.spv/rope_norm_f16.spv -DA_TYPE=float16_t -DD_TYPE=float16_t 

shaderc: internal error: compilation succeeded but failed to optimize: Expected input to have different bit width from Result Type: FConvert
%212 = OpFConvert %half %211

glslc在编译spv的过程中，使用FConvert 进行相同类型的变量类型转换，shaderc认为这是无效的，于是发生了编译报错。

只需要在编译过程中把优化关掉就行了。

std::string opt_level = (coopmat || name.find("bf16") != std::string::npos || name.find("rope") != std::string::npos) ? "" : "-O";

对应PR：https://github.com/ggml-org/l...

但是我认为这种修复方法很摆烂（虽然rope对性能影响也也不大），有没有什么方法能够彻底修复这个问题，同时能保留优化开启。

当然是有的，我们只需要设定预编译指令，在相同类型的时候不执行cast任务。

首先我们先把这个摆烂的修改revert了。

然后定位到编译报错的位置：
ggml/src/ggml-vulkan/vulkan-shaders/rope_neox.comp
ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp

发现有操作

if (i0 >= p.n_dims) {
  data_d[idst + i0/2 + 0] = data_a[ix + i0/2 + 0];
  data_d[idst + i0/2 + 1] = data_a[ix + i0/2 + 1];

  return;
}

当A_TYPE和D_TYPE相同时会发生相同的数据类型间转换，有些glslc不认这个操作，于是我们得想办法修复这个问题。

在ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp中检测是否类型相同

    if (name.find("rope") != std::string::npos && defines["D_TYPE"] == defines["A_TYPE"] ) {
        cmd.push_back("-DROPE_SAME_TYPE");
    }

在shader里判断ROPE_SAME_TYPE是否定义

        #ifdef ROPE_SAME_TYPE
        data_d[idst + i0/2 + 0] = data_a[ix + i0/2 + 0];
        data_d[idst + i0/2 + 1] = data_a[ix + i0/2 + 1];
        #else
        data_d[idst + i0/2 + 0] = D_TYPE(data_a[ix + i0/2 + 0]);
        data_d[idst + i0/2 + 1] = D_TYPE(data_a[ix + i0/2 + 1]);
        #endif

修改后编译成功

进行性能测试

cix@cix-localhost:~/llama.cpp/build$ llama-bench -m Llama-3.2-3B-Instruct.Q8_0.gguf -mg 0
| model                          |       size |     params | backend    | threads |          test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | CPU        |      12 |         pp512 |         40.76 ± 0.04 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | CPU        |      12 |         tg128 |          9.99 ± 0.06 |

cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.Q8_0.gguf --device none -t 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | threads | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |       1 | none         |           pp512 |          4.79 ± 0.00 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 |       1 | none         |           tg128 |          4.22 ± 0.00 |

cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.Q8_0.gguf --device Vulkan0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 | Vulkan0      |           pp512 |          6.51 ± 0.02 |
| llama 3B Q8_0                  |   3.18 GiB |     3.21 B | Vulkan     |  99 | Vulkan0      |           tg128 |          7.92 ±

cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.F16.gguf --device none   
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 3B F16                   |   5.98 GiB |     3.21 B | Vulkan     |  99 | none         |           pp512 |         31.07 ± 0.20 |
| llama 3B F16                   |   5.98 GiB |     3.21 B | Vulkan     |  99 | none         |           tg128 |          3.90 ± 0.07 |

cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.F16.gguf --device none -t 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | threads | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| llama 3B F16                   |   5.98 GiB |     3.21 B | Vulkan     |  99 |       1 | none         |           pp512 |          4.56 ± 0.00 |
| llama 3B F16                   |   5.98 GiB |     3.21 B | Vulkan     |  99 |       1 | none         |           tg128 |          3.78 ± 0.00 |

cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.F16.gguf --device Vulkan0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model                          |       size |     params | backend    | ngl | dev          |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 3B F16                   |   5.98 GiB |     3.21 B | Vulkan     |  99 | Vulkan0      |           pp512 |          8.80 ± 0.03 |
| llama 3B F16                   |   5.98 GiB |     3.21 B | Vulkan     |  99 | Vulkan0      |           tg128 |          3.31 ± 0.03 |

Llama 3.2 3B Instruct 性能评测报告

测试环境：

GPU：Mali-G720-Immortalis
Vulkan 支持：FP16
Shared Memory：32 KB
Warp Size：16
llama.cpp：本地 build（支持 Vulkan）
测试命令：llama-bench
模型文件：
- Llama-3.2-3B-Instruct.Q8_0.gguf（3.18 GiB）
- Llama-3.2-3B-Instruct.F16.gguf（5.98 GiB）

测试命令参数说明

参数	含义
`--device none`	禁用 GPU，使用 CPU 推理（即便 Vulkan 可用）
`--device Vulkan0`	指定使用 GPU 设备 0（Mali-G720）进行 Vulkan 加速
`-t <n>`	设置 CPU 线程数
`pp512`	Prefill 阶段测试（上下文填充速度）
`tg128`	Token Generation 阶段测试（文本生成速度）

性能测试结果

Llama-3.2-3B-Instruct.Q8\_0（8-bit 量化版）

模式	Backend	Threads	Device	Test	Tokens/s	备注
CPU (多线程)	CPU	12	CPU	pp512	40.76 ± 0.04	Prefill 速度极快
	CPU	12	CPU	tg128	9.99 ± 0.06	生成速率中等
CPU (单线程)	CPU	1	none	pp512	4.79 ± 0.00	明显受限于线程数
	CPU	1	none	tg128	4.22 ± 0.00	性能下降明显
GPU (Vulkan)	Vulkan	-	Vulkan0	pp512	6.51 ± 0.02	GPU 加速有限
	Vulkan	-	Vulkan0	tg128	7.92 ± 0.00	略优于 CPU 单线程

Llama-3.2-3B-Instruct.F16（半精度版）

模式	Backend	Threads	Device	Test	Tokens/s	备注
CPU (多线程)	CPU	12	none	pp512	31.07 ± 0.20	Prefill 性能不错
	CPU	12	none	tg128	3.90 ± 0.07	生成偏慢
CPU (单线程)	CPU	1	none	pp512	4.56 ± 0.00	线程受限
	CPU	1	none	tg128	3.78 ± 0.00	同上
GPU (Vulkan)	Vulkan	-	Vulkan0	pp512	8.80 ± 0.03	Prefill 有效提速
	Vulkan	-	Vulkan0	tg128	3.31 ± 0.03	生成速度未提升

性能分析总结

项目	结论
CPU 多线程表现最佳	Q8_0 模型在 CPU (12 线程) 下达 ≈ 41 tokens/s，远快于 GPU 模式。
Vulkan 加速有限	Mali-G720 无矩阵核心（int8/bf16），Vulkan 并行度有限。

总体结论

模型	最优模式	配置	峰值速度 (t/s)
Llama-3.2-3B Q8_0	CPU	12 线程	40.76
Llama-3.2-3B F16	GPU (Vulkan)	Vulkan0	8.80

【“星睿O6”AI PC开发套件评测】llama.cpp Vulkan后端适配大模型推理性能对比

编译与适配

Llama 3.2 3B Instruct 性能评测报告

测试命令参数说明

性能测试结果

Llama-3.2-3B-Instruct.Q8\_0（8-bit 量化版）

Llama-3.2-3B-Instruct.F16（半精度版）

性能分析总结

总体结论

推荐阅读

目录

【“星睿O6”AI PC开发套件评测】llama.cpp Vulkan后端适配 大模型推理性能对比

编译与适配

Llama 3.2 3B Instruct 性能评测报告

测试命令参数说明

性能测试结果

Llama-3.2-3B-Instruct.Q8\_0（8-bit 量化版）

Llama-3.2-3B-Instruct.F16（半精度版）

性能分析总结

总体结论

推荐阅读

目录

【“星睿O6”AI PC开发套件评测】llama.cpp Vulkan后端适配大模型推理性能对比