编译与适配
O6的GPU的OpenCL跑不了llama.cpp的ocl算子,所以我们只能用vulkan
编译教程:
sudo apt install cmake libcurl4-openssl-dev clang libomp-dev glslc
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
mkdir build && cd build
cmake .. \
-DCMAKE_CXX_FLAGS="-march=armv9-a" \
-DCMAKE_C_COMPILER=/usr/bin/clang \
-DCMAKE_CXX_COMPILER=/usr/bin/clang++ \
-DCMAKE_C_FLAGS="-march=armv9-a" \
-DGGML_CPU_KLEIDIAI=ON \
-DCMAKE_BUILD_TYPE=Release \
-DBUILD_SHARED_LIBS=OFF \
-DGGML_VULKAN=ON
make -j12
wget https://modelscope.cn/models/prithivMLmods/Llama-3.2-3B-Instruct-GGUF/resolve/master/Llama-3.2-3B-Instruct.Q8_0.ggufllama.cpp 必须 使用master版本,用老版本会有编译bug
clang: error: linker command failed with exit code 1 (use -v to see invocation) make[2]: *** [examples/batched/CMakeFiles/llama-batched.dir/build.make:107: bin/llama-batched] Error 1 make[1]: *** [CMakeFiles/Makefile2:3029: examples/batched/CMakeFiles/llama-batched.dir/all] Error 2 /usr/bin/ld: ../../ggml/src/ggml-vulkan/libggml-vulkan.a(ggml-vulkan.cpp.o): in function `ggml_vk_load_shaders(std::shared_ptr<vk_device_struct>&)': ggml-vulkan.cpp:(.text+0x3c310): undefined reference to `rope_norm_f16_rte_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c314): undefined reference to `rope_norm_f16_rte_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c338): undefined reference to `rope_norm_f16_rte_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c358): undefined reference to `rope_norm_f16_rte_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c380): undefined reference to `rope_neox_f16_rte_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c38c): undefined reference to `rope_neox_f16_rte_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c3b4): undefined reference to `rope_neox_f16_rte_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c3d4): undefined reference to `rope_neox_f16_rte_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c5e0): undefined reference to `rope_norm_f16_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c5e4): undefined reference to `rope_norm_f16_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c608): undefined reference to `rope_norm_f16_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c628): undefined reference to `rope_norm_f16_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c650): undefined reference to `rope_neox_f16_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c65c): undefined reference to `rope_neox_f16_len' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c684): undefined reference to `rope_neox_f16_data' /usr/bin/ld: ggml-vulkan.cpp:(.text+0x3c6a4): undefined reference to `rope_neox_f16_data'这是由于
cannot compile rope_norm_f16 /bin/glslc -fshader-stage=compute --target-env=vulkan1.2 -O /home/cix/llama.cpp/ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp -o /home/cix/llama.cpp/build/ggml/src/ggml-vulkan/vulkan-shaders.spv/rope_norm_f16.spv -DA_TYPE=float16_t -DD_TYPE=float16_t shaderc: internal error: compilation succeeded but failed to optimize: Expected input to have different bit width from Result Type: FConvert %212 = OpFConvert %half %211glslc在编译spv的过程中,使用FConvert 进行相同类型的变量类型转换,shaderc认为这是无效的,于是发生了编译报错。
只需要在编译过程中把优化关掉就行了。
std::string opt_level = (coopmat || name.find("bf16") != std::string::npos || name.find("rope") != std::string::npos) ? "" : "-O";对应PR:https://github.com/ggml-org/l...
但是我认为这种修复方法很摆烂(虽然rope对性能影响也也不大),有没有什么方法能够彻底修复这个问题,同时能保留优化开启。
当然是有的,我们只需要设定预编译指令,在相同类型的时候不执行cast任务。
首先我们先把这个摆烂的修改revert了。
然后定位到编译报错的位置:
ggml/src/ggml-vulkan/vulkan-shaders/rope_neox.comp
ggml/src/ggml-vulkan/vulkan-shaders/rope_norm.comp发现有操作
if (i0 >= p.n_dims) { data_d[idst + i0/2 + 0] = data_a[ix + i0/2 + 0]; data_d[idst + i0/2 + 1] = data_a[ix + i0/2 + 1]; return; }当A_TYPE和D_TYPE相同时会发生相同的数据类型间转换,有些glslc不认这个操作,于是我们得想办法修复这个问题。
- 在ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp中检测是否类型相同
if (name.find("rope") != std::string::npos && defines["D_TYPE"] == defines["A_TYPE"] ) { cmd.push_back("-DROPE_SAME_TYPE"); }
- 在shader里判断ROPE_SAME_TYPE是否定义
#ifdef ROPE_SAME_TYPE data_d[idst + i0/2 + 0] = data_a[ix + i0/2 + 0]; data_d[idst + i0/2 + 1] = data_a[ix + i0/2 + 1]; #else data_d[idst + i0/2 + 0] = D_TYPE(data_a[ix + i0/2 + 0]); data_d[idst + i0/2 + 1] = D_TYPE(data_a[ix + i0/2 + 1]); #endif修改后编译成功
进行性能测试
cix@cix-localhost:~/llama.cpp/build$ llama-bench -m Llama-3.2-3B-Instruct.Q8_0.gguf -mg 0
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 3B Q8_0 | 3.18 GiB | 3.21 B | CPU | 12 | pp512 | 40.76 ± 0.04 |
| llama 3B Q8_0 | 3.18 GiB | 3.21 B | CPU | 12 | tg128 | 9.99 ± 0.06 |cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.Q8_0.gguf --device none -t 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | threads | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| llama 3B Q8_0 | 3.18 GiB | 3.21 B | Vulkan | 99 | 1 | none | pp512 | 4.79 ± 0.00 |
| llama 3B Q8_0 | 3.18 GiB | 3.21 B | Vulkan | 99 | 1 | none | tg128 | 4.22 ± 0.00 |cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.Q8_0.gguf --device Vulkan0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 3B Q8_0 | 3.18 GiB | 3.21 B | Vulkan | 99 | Vulkan0 | pp512 | 6.51 ± 0.02 |
| llama 3B Q8_0 | 3.18 GiB | 3.21 B | Vulkan | 99 | Vulkan0 | tg128 | 7.92 ± cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.F16.gguf --device none
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 3B F16 | 5.98 GiB | 3.21 B | Vulkan | 99 | none | pp512 | 31.07 ± 0.20 |
| llama 3B F16 | 5.98 GiB | 3.21 B | Vulkan | 99 | none | tg128 | 3.90 ± 0.07 |cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.F16.gguf --device none -t 1
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | threads | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------ | --------------: | -------------------: |
| llama 3B F16 | 5.98 GiB | 3.21 B | Vulkan | 99 | 1 | none | pp512 | 4.56 ± 0.00 |
| llama 3B F16 | 5.98 GiB | 3.21 B | Vulkan | 99 | 1 | none | tg128 | 3.78 ± 0.00 |cix@cix-localhost:~/llama.cpp/build$ bin/llama-bench -m Llama-3.2-3B-Instruct.F16.gguf --device Vulkan0
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = Mali-G720-Immortalis (Mali-G720-Immortalis) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 16 | shared memory: 32768 | int dot: 0 | matrix cores: none
| model | size | params | backend | ngl | dev | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------------ | --------------: | -------------------: |
| llama 3B F16 | 5.98 GiB | 3.21 B | Vulkan | 99 | Vulkan0 | pp512 | 8.80 ± 0.03 |
| llama 3B F16 | 5.98 GiB | 3.21 B | Vulkan | 99 | Vulkan0 | tg128 | 3.31 ± 0.03 |
Llama 3.2 3B Instruct 性能评测报告
测试环境:
- GPU:Mali-G720-Immortalis
- Vulkan 支持:FP16
- Shared Memory:32 KB
- Warp Size:16
- llama.cpp:本地 build(支持 Vulkan)
- 测试命令:
llama-bench 模型文件:
Llama-3.2-3B-Instruct.Q8_0.gguf(3.18 GiB)Llama-3.2-3B-Instruct.F16.gguf(5.98 GiB)
测试命令参数说明
| 参数 | 含义 |
|---|---|
--device none | 禁用 GPU,使用 CPU 推理(即便 Vulkan 可用) |
--device Vulkan0 | 指定使用 GPU 设备 0(Mali-G720)进行 Vulkan 加速 |
-t <n> | 设置 CPU 线程数 |
pp512 | Prefill 阶段测试(上下文填充速度) |
tg128 | Token Generation 阶段测试(文本生成速度) |
性能测试结果
Llama-3.2-3B-Instruct.Q8\_0(8-bit 量化版)
| 模式 | Backend | Threads | Device | Test | Tokens/s | 备注 |
|---|---|---|---|---|---|---|
| CPU (多线程) | CPU | 12 | CPU | pp512 | 40.76 ± 0.04 | Prefill 速度极快 |
| CPU | 12 | CPU | tg128 | 9.99 ± 0.06 | 生成速率中等 | |
| CPU (单线程) | CPU | 1 | none | pp512 | 4.79 ± 0.00 | 明显受限于线程数 |
| CPU | 1 | none | tg128 | 4.22 ± 0.00 | 性能下降明显 | |
| GPU (Vulkan) | Vulkan | - | Vulkan0 | pp512 | 6.51 ± 0.02 | GPU 加速有限 |
| Vulkan | - | Vulkan0 | tg128 | 7.92 ± 0.00 | 略优于 CPU 单线程 |
Llama-3.2-3B-Instruct.F16(半精度版)
| 模式 | Backend | Threads | Device | Test | Tokens/s | 备注 |
|---|---|---|---|---|---|---|
| CPU (多线程) | CPU | 12 | none | pp512 | 31.07 ± 0.20 | Prefill 性能不错 |
| CPU | 12 | none | tg128 | 3.90 ± 0.07 | 生成偏慢 | |
| CPU (单线程) | CPU | 1 | none | pp512 | 4.56 ± 0.00 | 线程受限 |
| CPU | 1 | none | tg128 | 3.78 ± 0.00 | 同上 | |
| GPU (Vulkan) | Vulkan | - | Vulkan0 | pp512 | 8.80 ± 0.03 | Prefill 有效提速 |
| Vulkan | - | Vulkan0 | tg128 | 3.31 ± 0.03 | 生成速度未提升 |
性能分析总结
| 项目 | 结论 |
|---|---|
| CPU 多线程表现最佳 | Q8_0 模型在 CPU (12 线程) 下达 ≈ 41 tokens/s,远快于 GPU 模式。 |
| Vulkan 加速有限 | Mali-G720 无矩阵核心(int8/bf16),Vulkan 并行度有限。 |
总体结论
| 模型 | 最优模式 | 配置 | 峰值速度 (t/s) |
|---|---|---|---|
| Llama-3.2-3B Q8_0 | CPU | 12 线程 | 40.76 |
| Llama-3.2-3B F16 | GPU (Vulkan) | Vulkan0 | 8.80 |