基于 AX650N 部署 Qwen2

AI应用

选择芯片

Qwen是阿里巴巴集团Qwen团队研发的大语言模型和大型多模态模型系列。目前，大语言模型已升级至Qwen2版本。无论是语言模型还是多模态模型，均在大规模多语言和多模态数据上进行预训练，并通过高质量数据进行后期微调以贴近人类偏好。Qwen具备自然语言理解、文本生成、视觉理解、音频理解、工具使用、角色扮演、作为AI Agent进行互动等多种能力。

官方文档：https://qwen.readthedocs.io/z...
Github项目：https://github.com/QwenLM/Qwen2

最新版本Qwen2有以下特点：

5种模型规模，包括0.5B、1.5B、7B、57B-A14B和72B；
针对每种尺寸提供基础模型和指令微调模型，并确保指令微调模型按照人类偏好进行校准；
基础模型和指令微调模型的多语言支持；
所有模型均稳定支持32K长度上下文；
支持工具调用、RAG（检索增强文本生成）、角色扮演、AI Agent等。

爱芯元智AX650N介绍

爱芯元智第三代高能效比智能视觉芯片AX650N。集成了八核Cortex-A55 CPU，高能效比NPU，支持8K@30fps的ISP，以及H.264、H.265编解码的 VPU。接口方面，AX650N支持64bit LPDDR4x，多路MIPI输入，千兆Ethernet、USB、以及HDMI 2.0b输出，并支持32路1080p@30fps解码内置高算力和超强编解码能力，满足行业对高性能边缘智能计算的需求。通过内置多种深度学习算法，实现视觉结构化、行为分析、状态检测等应用，高效率支持基于Transformer结构的视觉大模型和语言类大模型。提供丰富的开发文档，方便用户进行二次开发。

LLM编译

下载ax-llm-build项目

默认用户已经按照Pulsar2 v3.0-temp版本文档中《开发环境准备》章节完成docker镜像安装并已进入pulsar2的docker环境。

git clone https://github.com/AXERA-TECH/ax-llm-build.git

下载Qwen2-0.5B-Instruct

cd ax-llm-buildpip install -U huggingface_hubhuggingface-cli download --resume-download Qwen/Qwen2-0.5B-Instruct --local-dir Qwen/Qwen2-0.5B-Instruct

编译执行

pulsar2 llm_build --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --model_config config/qwen2-0.5B.json --hidden_state_type bf16 --weight_type s8

log参考信息

root@gpux2:/data/ax-llm-build# pulsar2 llm_build --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/ --kv_cache_len 1023 --model_config config/qwen2-0.5B.json --hidden_state_type bf16 --weight_type s8
Config(    
    model_name='Qwen/Qwen2-0.5B-Instruct',    
    model_type='qwen',    
    num_hidden_layers=24,    
    num_attention_heads=14,    
    num_key_value_heads=2,    
    hidden_size=896,    
    intermediate_size=4864,    
    vocab_size=151936,    
    rope_theta_base=1000000.0,    
    max_position_embedings=32768,    
    rope_partial_factor=1.0,    
    norm_eps=1e-06,    
    norm_type='rms_norm',    
    hidden_act='silu')
2024-07-01 11:17:08.009 | SUCCESS  | yamain.command.llm_build:llm_build:85 - prepare llm model done!
building llm decode layers   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24/24 0:03:59
building llm post layer   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1/1 0:01:24
2024-07-01 11:22:31.941 | SUCCESS  | yamain.command.llm_build:llm_build:128 - build llm model done!
2024-07-01 11:22:56.925 | SUCCESS  | yamain.command.llm_build:llm_build:277 - check llm model done!

embed提取和优化

python tools/extract_embed.py --input_path Qwen/Qwen2-0.5B-Instruct/ --output_path Qwen/Qwen2-0.5B-w8a16/
python tools/embed-process.py --input Qwen/Qwen2-0.5B-w8a16/model.embed_tokens.weight.npy --output Qwen/Qwen2-0.5B-w8a16/model.embed_tokens.weight.float32.bin
chmod +x ./tools/fp32_to_bf16
./tools/fp32_to_bf16 Qwen/Qwen2-0.5B-w8a16/model.embed_tokens.weight.float32.bin Qwen/Qwen2-0.5B-w8a16/model.embed_tokens.weight.bfloat16.bin

输出文件说明

root@xxx:/data/ax-llm-build# tree Qwen/Qwen2-0.5B-w8a16
Qwen/Qwen2-0.5B-w8a16
├── model.embed_tokens.weight.bfloat16.bin
├── model.embed_tokens.weight.float32.bin
├── model.embed_tokens.weight.npy
├── qwen_l0.axmodel
├── qwen_l10.axmodel
├── qwen_l11.axmodel
├── qwen_l12.axmodel
├── qwen_l13.axmodel
......
├── qwen_l7.axmodel
├── qwen_l8.axmodel
├── qwen_l9.axmodel
└── qwen_post.axmodel

其中，model.embed_tokens.weight.bfloat16.bin、qwen_l0.axmodel ~ qwen_l23.axmodel、qwen_post.axmodel，是上板运行需要的。

开发板运行

ax-llm项目

ax-llm项目用于探索业界常用LLM(Large Language Model)在AXERA已有芯片平台上落地的可行性和相关能力边界，方便社区开发者进行快速评估和二次开发自己的LLM应用。

同时，我们在网盘中已经提供好了分别基于AX650N和AX630C平台预编译好的部分LLM示例。

执行过程（基于AX650N开发板）

root@ax650:/mnt/qtang/llama_axera_cpp# ./run_qwen2_0.5B.sh
[I][                            Init][  71]: LLM init start
  3% | ██                                |   1 /  27 [0.28s<7.48s, 3.61 count/s] tokenizer init ok
[I][                            Init][  26]: LLaMaEmbedSelector use mmap
100% | ████████████████████████████████ |  27 /  27 [7.40s<7.40s, 3.65 count/s] init post axmodel okremain_cmm(11583 MB)
[I][                            Init][ 180]: max_token_len : 1023
[I][                            Init][ 185]: kv_cache_size : 128, kv_cache_num: 1023
[I][                            Init][ 199]: LLM init ok
Type "q" to exit, Ctrl+c to stop current running
>> who are you?
I am a large language model created by Alibaba Cloud. I am called Qwen.
[N][                             Run][ 388]: hit eos,avg 24.51 token/s

性能统计

AX650N、AX630C目前均采用W8A16量化方案（性能优化会持续进行）

模型名称	参数量	速度(token/s)
TinyLlama-1.1	1.1B	16.5
Qwen2.0	0.5B	29.0
Qwen2.0	1.5B	11.2
MiniCPM	2.4B	6.0
Phi3Qwen2.0	3.8B	5.0
Llama3	8B	2.5

结束语

随着大语言模型小型化的快速发展，越来越多有趣的多模态AI应用将逐渐从云端服务迁移到边缘侧设备和端侧设备。我们会紧跟行业最新动态，欢迎大家持续关注。

AI 开放平台

AI应用

选择芯片