基于此芯P1部署MiniCPM-O2.6多模态模型

AI应用

选择芯片

MiniCPM-o 2.6是面壁智能发布的全球首个端侧全模态AI模型，仅8亿参数（8B），性能对标GPT-4o、Claude-3.5-Sonnet等顶尖模型。其最大亮点在于本地化部署能力，支持在iPad等移动设备上实现实时多模态交互，标志着AI技术从云端向终端设备的重大迁移。

技术架构与创新

端到端全模态流式架构
采用模块化设计，统一处理文本、图像、音频和视频等多模态数据流，通过低延迟模态并发技术和时分复用技术，实现毫秒级响应速度，提升交互流畅度。
高效模型压缩与优化
在保证性能的前提下，通过算法优化将参数量压缩至8B，相比传统大模型（如GPT-3的1750亿参数）显著降低硬件资源需求，使复杂AI任务可在移动设备本地运行。

核心能力

多模态感知与交互

视觉理解：实时追踪动态场景（如记忆卡牌游戏中的牌面位置、小球运动轨迹），支持流式视频分析与复杂场景推理。
听觉处理：直接解析声波细节，识别背景音（撕纸声、水流声）及情感语气，超越传统模型依赖文本转写的局限。
语音生成：支持中英双语对话，可配置情感、语速和风格，并具备声音克隆与角色扮演能力。

性能表现

在OpenCompass多模态评测基准中，平均得分70.2，单图理解能力超越GPT-4o-202405、Gemini 1.5 Pro等闭源模型。
开源社区中模态支持最全，语音理解与生成能力达到SOTA（State of the Art）水平。

应用场景

移动设备：iPad等终端可本地运行GPT-4o级别模型，实现低延迟的实时交互（如语音助手、游戏辅助）。
智能硬件：适配智能家居、自动驾驶汽车及人形机器人，强化环境感知与决策能力。
内容创作：支持AI绘画、文案生成等创意工具，结合多模态输入提升创作效率。

开源与生态

开源地址：代码与模型权重已在GitHub（OpenBMB/MiniCPM-o）和Hugging Face（openbmb/MiniCPM-o-2_6）发布。
开发者生态：累计下载超400万次，吸引硬件制造商合作布局具身智能设备（如机器人、智能汽车）。

minicpm-o2.6_perf.jpg

MiniCPM-o 2.6 可以通过多种方式轻松使用：(1) llama.cpp 支持在本地设备上进行高效的 CPU 推理，(2) int4 和 GGUF 格式的量化模型，有 16 种尺寸，(3) vLLM 支持高吞吐量和内存高效的推理，(4) 通过LLaMA-Factory框架针对新领域和任务进行微调，(5) 使用 Gradio 快速设置本地 WebUI 演示，(6) 在线demo。本文详细说明如何在瑞莎星睿 O6 开发板上使用 llama.cpp 高效运行 MiniCPM-o 2.6 推理。

编译 llama.cpp

克隆仓库

git clone https://github.com/ggml-org/llama.cpp.git

安装编译工具

sudo apt install cmake gcc g++

编译项目

瑞莎星睿 O6 搭载 ARM-v9 CPU，可添加 armv9-a 编译选项开启 sve、i8mm 和点积加速等硬件功能优化。

cmake -B build -DGGML_NATIVE=OFF -DGGML_CPU_ARM_ARCH=armv9-a+i8mm+dotprod+sve+sve2+fp16
cmake --build build --config Release -j8

模型准备和部署

下载 Huggingface 模型

从 Huggingface 下载 MiniCPM-o-2_6 Pytorch 模型，并将 Pytorch 模型转换为 GGUF 格式。

python ./examples/llava/minicpmv-surgery.py -m ../MiniCPM-o-2_6
python ./examples/llava/minicpmv-convert-image-encoder-to-gguf.py -m ../MiniCPM-o-2_6 --minicpmv-projector ../MiniCPM-o-2_6/minicpmv.projector --output-dir ../MiniCPM-o-2_6/ --image-mean 0.5 0.5 0.5 --image-std 0.5 0.5 0.5 --minicpmv_version 4
python ./convert_hf_to_gguf.py ../MiniCPM-o-2_6/model

# quantize int4 version
./build/bin/llama-quantize ../MiniCPM-o-2_6/model/ggml-model-f16.gguf ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf Q4_K_M

也可以直接下载转换好的 GGUF 模型。

上板推理

本实验主要验证模型的识图能力，即输入一副图像，让模型描述图像的内容。

可以使用llama.cpp库编译得到的bin文件 llama-minicpmv-cli, 选择运行浮点模型或int4量化后的模型：

# run f16 version
./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-f16.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg -p "What is in the image?"

# run quantized int4 version
./build/bin/llama-minicpmv-cli -m ../MiniCPM-o-2_6/model/ggml-model-Q4_K_M.gguf --mmproj ../MiniCPM-o-2_6/mmproj-model-f16.gguf -c 4096 --temp 0.7 --top-p 0.8 --top-k 100 --repeat-penalty 1.05 --image xx.jpg  -p "What is in the image?"

模型对图片内容给出了详细的描述：

<user>What is in the image?
<assistant>
The image shows three young boys playing soccer on a field. One boy is wearing an orange shirt, while the other two are dressed in blue shirts with numbers 7 and 11 printed on them respectively. They appear to be engaged in active gameplay, possibly competing for control of the ball or strategizing their next move during what seems like an organized match given the presence of spectators in the background. The scene captures a moment typical of youth sports activities – teamwork, competition, physical activity, and social interaction among peers under adult supervision.

星睿O6搭载的强劲CPU在多模态模型推理方面表现出色，获得了7.7token/s的解码输出。

llama_perf_context_print:        load time =    3397.87 ms
llama_perf_context_print: prompt eval time =    2052.77 ms /    79 tokens (   25.98 ms per token,    38.48 tokens per second)
llama_perf_context_print:        eval time =   13815.58 ms /   107 runs   (  129.12 ms per token,     7.74 tokens per second)
llama_perf_context_print:       total time =   17473.33 ms /   186 tokens

总结

MiniCPM-o 2.6通过轻量化架构与全模态能力，重新定义了端侧AI的可能性。其技术突破不仅降低了AI部署门槛，更推动了智能终端向情感化、自然化交互演进。随着开源生态的扩展与硬件升级，该模型有望成为AI普惠化进程中的关键驱动力。

AI 开放平台

AI应用

选择芯片