标题精简了,原标题是
【“星睿O6”AI PC开发套件评测】RVM人像分割torch➡️pnnx➡️cix量化➡️o6-NPU和ncnn-CPU/GPU部署全过程
安谋科技、此芯科技与瑞莎计算机联合打造了面向AI PC、边缘、机器人等不同场景的“星睿O6”开发套件,该套件异构集成了Arm®v9 CPU核心、Arm Immortalis™ GPU以及安谋科技“周易”NPU
RVM 人像分割简介
稳定视频抠像(RVM)是一款功能强大的实时视频人像抠图技术,其由字节跳动项目组开发。不同于现有神经网络将每一帧作为单独图片处理,RVM使用循环神经网络,在处理视频流时有时间记忆。RVM可在任意视频上做实时高清抠像。在Nvidia GTX 1080Ti上实现4K 76FPS和HD 104FPS。
https://github.com/PeterL1n/R...
torch 模型 ➡️ pnnx ➡️ onnx
官方提供了mobilenetv3和resnet50两种,这里选择精度更好,也比较适合NPU的resnet50模型
https://github.com/PeterL1n/R...
pip install torch torchvision
git clone https://github.com/PeterL1n/RobustVideoMatting
cd RobustVideoMatting
pip install -r requirements_inference.txt
wget https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50.pth
准备好运行环境和模型后,使用pnnx工具将torch模型优化一边,再导出到ncnn和onnx
PNNX(PyTorch Neural Network Exchange)是PyTorch模型部署的新的方式,可以避开ONNX中间商,导出比较干净的高层OP
pnnx工具能做强大的图优化,转换出的模型通常更适合推理部署,并且也支持导出torch代码,ncnn模型,onnx模型
pip install pnnx
在 RobustVideoMatting 目录中新建 export_pnnx.py
写个简单的推理过程和导出代码
输入以下文件:
- rvm_resnet50.pth 原始RVM模型权重
- 512.png 测试的人像图片
产生以下这些文件:
- rvm_resnet50_pnnx.py 优化后的RVM模型python代码
- rvm_resnet50.pnnx.param/pnnx.bin 优化后的RVM模型结构pnnx定义和权重
- rvm_resnet50.ncnn.param/ncnn.bin 优化后导出的ncnn模型
- rvm_resnet50.onnx 优化后的onnx模型
- in.npy.npz 512.png经过前处理的tensor和首帧推理得到的4个特征tensor
import torch
from torch import nn
from model import MattingNetwork
from PIL import Image
from torchvision import transforms
import numpy
import pnnx
class Model(nn.Module):
def __init__(self):
super().__init__()
self.rvm = MattingNetwork('resnet50').eval()
self.rvm.load_state_dict(torch.load('rvm_resnet50.pth'))
def forward_first_frame(self, src):
return self.rvm(src)
def forward(self, src, r1, r2, r3, r4):
return self.rvm(src, r1, r2, r3, r4)
model = Model().eval()
with Image.open('512.png') as img:
img.load()
transform = transforms.ToTensor()
x = transform(img).unsqueeze(0)
fgr, pha, r1, r2, r3, r4 = model.forward_first_frame(x)
opt_model = pnnx.export(model, "rvm_resnet50.pt", (x, r1, r2, r3, r4))
torch.onnx.export(opt_model, (x, r1, r2, r3, r4), "rvm_resnet50.onnx", export_params=True, opset_version=13, input_names=['in0', 'in1', 'in2', 'in3', 'in4'], output_names=['out0', 'out1', 'out2', 'out3', 'out4', 'out5'])
numpy.savez('in.npy', in0=x.numpy(), in1=r1.detach().numpy(), in2=r2.detach().numpy(), in3=r3.detach().numpy(), in4=r4.detach().numpy())
代码注解1 模型导出部分
RVM支持首帧和后续帧两种推理方式
- 首帧:512x512 RGB 图片 + 初始化为全零的4个特征 ➡️ 前景 RGB 图片 + 分割MASK + 当前帧4个特征
- 后续帧:512x512 RGB 图片 + 前一帧的4个特征 ➡️ 前景 RGB 图片 + 分割MASK + 当前帧4个特征
如果采用首帧模式导出模型,模型中将会含有大量数值为零的特征,容易导致后续优化和量化的异常。实际导出应该使用后续帧的模式。
代码注解2 pnnx优化部分
opt_model = pnnx.export(model, "rvm_resnet50.pt", (x, r1, r2, r3, r4))
这行会将原始RVM模型,通过pnnx优化,返回优化后的RVM模型。优化后的RVM模型是标准的torch nn module,能直接接替后续模型操作,比如导出onnx
代码注解3 校准数据准备
numpy.savez('in.npy', in0=x.numpy(), in1=r1.detach().numpy(), in2=r2.detach().numpy(), in3=r3.detach().numpy(), in4=r4.detach().numpy())
后续帧模型需要图片+4个特征,总共5个输入。通常单输入模型直接用 numpy.save('in0.npy', in0.numpy())
存出单个npy即可,而多输入需要存出 npz 打包格式,后面给 cix 量化工具会需要
cix 量化
编写一个 rvm_resnet50.cfg
配置文件
[Common]
mode = build
[Parser]
model_type = onnx
model_name = rvm_resnet50
detection_postprocess =
model_domain = object_detection
input_model = ./rvm_resnet50.onnx
output_dir = ./
input_shape = [1, 3, 512, 512],[1, 16, 256, 256],[1, 32, 128, 128],[1, 64, 64, 64],[1, 128, 32, 32]
input = in0,in1,in2,in3,in4
[Optimizer]
calibration_data = in.npy.npz
calibration_batch_size = 1
metric_batch_size = 1
output_dir = ./
dataset = numpymultiinputdataset
save_statistic_info = True
cast_dtypes_for_lib = True
quantize_method_for_activation = per_tensor_asymmetric
quantize_method_for_weight = per_channel_symmetric_restricted_range
[GBuilder]
target = X2_1204MP3
outputs = rvm_resnet50.cix
profile = True
tiling = fps
配置注解1 多输入的写法
input_shape = [1, 3, 512, 512],[1, 16, 256, 256],[1, 32, 128, 128],[1, 64, 64, 64],[1, 128, 32, 32]
input = in0,in1,in2,in3,in4
用逗号分隔多个shape和输入node名字,shape信息可以从上面生成的 rvm_resnet50_pnnx.py
的最后一段中直接找到,node名字应当与torch.onnx.export
导出时设置的input_names
对应
def test_inference():
net = Model()
net.float()
net.eval()
torch.manual_seed(0)
v_0 = torch.rand(1, 3, 512, 512, dtype=torch.float)
v_1 = torch.rand(1, 16, 256, 256, dtype=torch.float)
v_2 = torch.rand(1, 32, 128, 128, dtype=torch.float)
v_3 = torch.rand(1, 64, 64, 64, dtype=torch.float)
v_4 = torch.rand(1, 128, 32, 32, dtype=torch.float)
return net(v_0, v_1, v_2, v_3, v_4)
if __name__ == "__main__":
print(test_inference())
配置注解2 多输入的校准数据
cix工具支持numpymultiinputdataset
数据格式
quantize_method_for_activation 和 quantize_method_for_weight 的两个设置能提升量化后的模型精度
resnet50结构中大量使用 ReLU
激活,activation asymmetric 模式能正好节约 relu 的运算
calibration_data = in.npy.npz
dataset = numpymultiinputdataset
quantize_method_for_activation = per_tensor_asymmetric
quantize_method_for_weight = per_channel_symmetric_restricted_range
libaipu_simulator_x2.so 错误解决
更新到 CixBuilder-6.1.3119.3-py3-none-linux_x86_64.whl
之后使用 cixbuild 报错找不到 libaipu_simulator_x2.so
cixbuild rvm_resnet50.cfg
Traceback (most recent call last):
File "/home/nihui/.local/bin/cixbuild", line 8, in <module>
sys.exit(build_main())
File "/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/build_cix.py", line 321, in build_main
main()
File "AIPUBuilder/main.pyx", line 89, in AIPUBuilder.main.main
File "/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/model/__init__.py", line 7, in <module>
from .aipullm import AIPULLM
File "AIPUBuilder/model/aipullm.pyx", line 9, in init AIPUBuilder.model.aipullm
File "/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/cruntime/__init__.py", line 1, in <module>
from AIPUBuilder._C.CompassLLM import (
ImportError: libaipu_simulator_x2.so: cannot open shared object file: No such file or directory
经调查发现,上一个版本并没有这个依赖,是这个版本新加的,找到so路径
export LD_LIBRARY_PATH=/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
cixbuild rvm_resnet50.cfg
输出信息,最后得到 rvm_resnet50.cix
nihui@nihui-pc:~/dev/o6-test$ export LD_LIBRARY_PATH=/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
nihui@nihui-pc:~/dev/o6-test$ cixbuild rvm_resnet50.cfg
[I] Build with version 6.1.3119
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model rvm_resnet50...
2025-04-15 22:47:46.411728: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/nihui/.local/lib/python3.8/site-packages/cv2/../../lib64:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-15 22:47:46.411753: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-04-15 22:47:47.518775: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/nihui/.local/lib/python3.8/site-packages/cv2/../../lib64:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-15 22:47:47.518799: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2025-04-15 22:47:47.518820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (nihui-pc): /proc/driver/nvidia/version does not exist
2025-04-15 22:47:48.197359: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I] [Parser]: The input tensor(s) is/are: in0_0,in1_0,in2_0,in3_0,in4_0
[I] [Parser]: Input in0 from cfg is shown as tensor in0_0 in IR!
[I] [Parser]: Input in1 from cfg is shown as tensor in1_0 in IR!
[I] [Parser]: Input in2 from cfg is shown as tensor in2_0 in IR!
[I] [Parser]: Input in3 from cfg is shown as tensor in3_0 in IR!
[I] [Parser]: Input in4 from cfg is shown as tensor in4_0 in IR!
[I] [Parser]: 0 error(s), 0 warning(s) generated.
[I] [Parser]: Parser done!
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_4_15_22_47_45_xk63i/rvm_resnet50.txt
[I] [IRChecker] model_name: rvm_resnet50
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i/rvm_resnet50.bin size: 0x667a290
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------
OpType.Transpose: -12
OpType.Eltwise: -1
OpType.Mul: +1
OpType.Reshape: +1
OpType.Tile: -1
------------------------------------------------------------------------
[I] Simplify Done.
[I] Simplify float model Done.
[I] Optimizing model....
[I] [OPT] [22:47:54]: [arg_parser] is running.
[I] [OPT] [22:47:54]: tool name: Compass-Optimizer, version: 1.3.3119, use cuda: False, running device: cpu
[I] [OPT] [22:47:54]: [quantization config Info][model name]: rvm_resnet50, [quantization method for weight]: per_channel_symmetric_restricted_range, [quantization method for activation]: per_tensor_asymmetric, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8
[I] [OPT] [22:47:54]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
[I] [OPT] [22:47:54]: IR loaded.
Building graph: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 1466.19it/s]
[I] [OPT] [22:47:54]: Begin to load weights.
[I] [OPT] [22:47:54]: Weights loaded.
Deserializing bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 7408.10it/s]
[I] [OPT] [22:47:54]: Successfully parsed IR with python API.
[I] [OPT] [22:47:54]: init graph by forwarding one sample filled with zeros
forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:01<00:00, 180.11it/s]
[I] [OPT] [22:47:55]: [graph_optimize_stage1] is running.
[I] [OPT] [22:47:55]: [statistic] is running.
statistic batch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00, 3.27s/it]
[I] [OPT] [22:47:59]: [graph_optimize_stage2] is running.
[I] [OPT] [22:47:59]: applying calibration strategy based on statistic info
calibration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 25190.19it/s]
[I] [OPT] [22:47:59]: [quantize] is running.
update_tensor_quantization_attrs: 0%| | 0/184 [00:00<?, ?it/s][W] [OPT] [22:47:59]: due to hardware limitations, it is actually doing per-2-channel quantization, which may cause accuracy dropping: layer_id=7, type=OpType.Convolution, name=/convbn2d_0/Conv_clone_, rescale values differ sharpely whithin channels,
.... 省略很多warning
quantize each layer: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 193/193 [00:00<00:00, 1909.74it/s]
[I] [OPT] [22:48:00]: collecting per-layer similarity infomation between float graph and quanted graph by forwarding 1 sample on both of them
[I] [OPT] [22:48:03]: [graph_optimize_stage3] is running.
[I] [OPT] [22:48:03]: [serialize] is running.
[I] [OPT] [22:48:03]: check the final graph by forwarding one sample filled with zeros
forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:01<00:00, 103.22it/s]
[I] [OPT] [22:48:05]: Begin to serialzie IR
Writing IR: 184it [00:00, 5817.17it/s]
Serializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 593/593 [00:00<00:00, 54544.35it/s]
[I] [OPT] [22:48:05]: IR has been saved into /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i
[I] [OPT] [22:48:05]: Compass-Optimizer has done at [serialize] period.
[I] [OPT] [22:48:05]: [Done]cost time: 20s, and [qinfos(scale, zp, dtype)]: out: [[255.0, 0, UINT8], [255.0, 0, UINT8], [127.50077819824219, 1, INT8], [127.5284652709961, 0, INT8], [127.5, 0, INT8], [127.50006103515625, 1, INT8]] in: [[255.0, 0, UINT8], [127.51888275146484, 1, INT8], [127.72821044921875, 0, INT8], [127.50000762939453, 1, INT8], [127.50524139404297, 0, INT8]] [output tensors cosine]: [0.9980738479792883, 0.9999214568249806, 0.9992704966369025, 0.9984071475614615, 0.9973715154248336, 0.994706231139195][output tensors MSE]: [0.0005080156843177974, 4.108408393221907e-05, 0.0007104792748577893, 0.0008333915611729026, 0.001465060980990529, 0.0011862480314448476]
[I] Optimizing model complete
[I] Simplifying quant model...
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant.txt
[I] [IRChecker] model_name: rvm_resnet50
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant.bin size: 0x1a1cad0
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------
OpType.Transpose: -16
OpType.Activation: -1
------------------------------------------------------------------------
[I] Simplify Done.
[I] Simplify quant model Done.
[I] Building ...
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant_s.txt
[I] [IRChecker] model_name: rvm_resnet50
[I] [IRChecker] IRChecker: All IR pass
[I] [tools.cpp : 352] BuildTool version: 6.1.3119. Build for target X2_1204MP3 PID: 17449
[I] [tools.cpp : 372] using default profile events to profile default
[I] [tools.cpp : 834] global cwd: /tmp/b098c1a7ef02ba7d25d0b238c21a27d9a521a043171e820632d2419e6fde
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant_s.bin size: 0x1a1c9ac
[I] [tiling.cpp:4500] Auto tiling now, please wait ...
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_0 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_1 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_2 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_3 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_4 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_5 maybe this merge point could be removed
.... 省略很多warning
[I] [layoutconvertor.cpp: 276] Building /Concat_15/tile/out/9/pad/layout/NCHWC32T8...
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/11/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/7/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/8/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/9/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [builder.cpp:1938] The graph DDR Footprint requirement(estimation) of feature maps:
[I] [builder.cpp:1939] Read and Write:676.86MB
[I] [builder.cpp:1080] Reduce constants memory size: 42.164MB
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 63] /usr/bin//../lib//libmcheck.ais not a archive file.
[I] [builder.cpp:2411] memory statistics for this graph (rvm_resnet50)
[I] [builder.cpp: 585] Total memory : 0x020ee424 Bytes ( 32.931MB)
[I] [builder.cpp: 585] Text section: 0x000ce550 Bytes ( 0.806MB)
[I] [builder.cpp: 585] RO section: 0x00031a00 Bytes ( 0.194MB)
[I] [builder.cpp: 585] Desc section: 0x00112a00 Bytes ( 1.073MB)
[I] [builder.cpp: 585] Data section: 0x01e4dc20 Bytes ( 30.304MB)
[I] [builder.cpp: 585] BSS section: 0x0004dab4 Bytes ( 0.303MB)
[I] [builder.cpp: 585] Stack : 0x00040400 Bytes ( 0.251MB)
[I] [builder.cpp: 585] Workspace(BSS) : 0x00100000 Bytes ( 1.000MB)
[I] [builder.cpp:2427]
[I] [tools.cpp :1181] - compile time: 2.914 s
[I] [tools.cpp :1087] With GM optimization, DDR Footprint stastic(estimation):
[I] [tools.cpp :1094] Read and Write:717.02MB
[I] [tools.cpp :1137] - draw graph time: 0.083 s
[I] [tools.cpp :1954] remove global cwd: /tmp/b098c1a7ef02ba7d25d0b238c21a27d9a521a043171e820632d2419e6fde
build success.......
Total errors: 0, warnings: 128
o6 ncnn CPU/GPU C++部署
让我们快速编译个ncnn库吧!
git clone https://github.com/Tencent/ncnn.git
cd ncnn
git submodule update --init --recursive --depth 1
mkdir build
cd build
cmake -DNCNN_VULKAN=ON ..
make -j12
make install
再搞个cmake工程配置
cmake_minimum_required(VERSION 3.10)
project(rvm)
set(CMAKE_BUILD_TYPE Release)
find_package(OpenCV REQUIRED)
set(ncnn_DIR "/home/radxa/ncnn/build/install/lib/cmake/ncnn")
find_package(ncnn REQUIRED)
add_executable(rvm rvm.cpp)
target_link_libraries(rvm ncnn ${OpenCV_LIBS})
写个ncnn rvm推理代码,输入BGR,输出合成后的BGR,支持CPU/GPU
#include "net.h"
class RVM_ncnn
{
public:
void load(bool use_gpu = false)
{
net.opt.use_vulkan_compute = use_gpu;
net.load_param("/home/radxa/rvm/rvm_resnet50.ncnn.param");
net.load_model("/home/radxa/rvm/rvm_resnet50.ncnn.bin");
r1 = ncnn::Mat(256, 256, 16);
r2 = ncnn::Mat(128, 128, 32);
r3 = ncnn::Mat(64, 64, 64);
r4 = ncnn::Mat(32, 32, 128);
r1.fill(0.0f);
r2.fill(0.0f);
r3.fill(0.0f);
r4.fill(0.0f);
}
void run(const cv::Mat& bgr, cv::Mat& out)
{
ncnn::Extractor ex = net.create_extractor();
ncnn::Mat in0 = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, 512, 512);
const float mean_vals[3] = {0, 0, 0};
const float norm_vals[3] = {1 / 255.0, 1 / 255.0, 1 / 255.0};
in0.substract_mean_normalize(mean_vals, norm_vals);
ex.input("in0", in0);
ex.input("in1", r1);
ex.input("in2", r2);
ex.input("in3", r3);
ex.input("in4", r4);
ncnn::Mat fgr;
ncnn::Mat pha;
ex.extract("out0", fgr);
ex.extract("out1", pha);
const float demean_vals[3] = {0, 0, 0};
const float denorm_vals[3] = {255.0, 255.0, 255.0};
fgr.substract_mean_normalize(demean_vals, denorm_vals);
fgr.to_pixels(out.data, ncnn::Mat::PIXEL_RGB2BGR);
// composite
for (int y = 0; y < 512; y++)
{
unsigned char* p = (unsigned char*)out.data + y * 512 * 3;
const float* ppha = (const float*)pha.data + y * 512;
for (int x = 0; x < 512; x++)
{
float alpha = *ppha++;
// 0~127 to 0~255
p[0] = p[0] * alpha + (1 - alpha) * 155;
p[1] = p[1] * alpha + (1 - alpha) * 255;
p[2] = p[2] * alpha + (1 - alpha) * 120;
p += 3;
}
}
}
private:
ncnn::Net net;
ncnn::Mat r1;
ncnn::Mat r2;
ncnn::Mat r3;
ncnn::Mat r4;
};
o6 NPU C++部署
写个cix npu rvm推理代码,输入BGR,输出合成后的BGR
输入和输出的 int8/uint8 缩放系数可以从cixbuild生成的 internal/rvm_resnet50_quant_s.txt中找到
#include <stdio.h>
#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include <npu/cix_noe_standard_api.h>
class RVM_noe
{
public:
RVM_noe()
{
ctx = 0;
graph_id = 0;
job_id = 0;
noe_init_context(&ctx);
}
~RVM_noe()
{
noe_clean_job(ctx, job_id);
noe_unload_graph(ctx, graph_id);
noe_deinit_context(ctx);
}
void load()
{
noe_load_graph(ctx, "/home/radxa/rvm/rvm_resnet50.cix", &graph_id);
noe_dynshape_param_t dynshape = {0, 0};
job_config_npu_t job_cfg_npu;
job_cfg_npu.partition_id = 0;
job_cfg_npu.dbg_dispatch = 0;
job_cfg_npu.dbg_core_id = 0;
job_cfg_npu.fm_idxes = 0;
job_cfg_npu.fm_idxes_cnt = 0;
job_cfg_npu.dynshape = &dynshape;
job_config_t job_cfg = {&job_cfg_npu};
noe_create_job(ctx, graph_id, &job_id, &job_cfg);
r1 = cv::Mat({16, 256, 256}, CV_8UC1);
r2 = cv::Mat({32, 128, 128}, CV_8UC1);
r3 = cv::Mat({64, 64, 64}, CV_8UC1);
r4 = cv::Mat({128, 32, 32}, CV_8UC1);
r1 = cv::Scalar(0);
r2 = cv::Scalar(0);
r3 = cv::Scalar(0);
r4 = cv::Scalar(0);
}
void run(const cv::Mat& bgr, cv::Mat& out)
{
cv::Mat rgb({3, 512, 512}, CV_8UC1);
for (int y = 0; y < 512; y++)
{
const unsigned char* p = (const unsigned char*)bgr.data + y * 512 * 3;
signed char* pr = (signed char*)rgb.data + y * 512;
signed char* pg = pr + 512 * 512;
signed char* pb = pg + 512 * 512;
for (int x = 0; x < 512; x++)
{
// 0~255 to 0~127
*pr++ = p[0] * 127 / 255;
*pg++ = p[1] * 127 / 255;
*pb++ = p[2] * 127 / 255;
p += 3;
}
}
noe_load_tensor(ctx, job_id, 0, rgb.data);
noe_load_tensor(ctx, job_id, 1, r1.data);
noe_load_tensor(ctx, job_id, 2, r2.data);
noe_load_tensor(ctx, job_id, 3, r3.data);
noe_load_tensor(ctx, job_id, 4, r4.data);
noe_job_infer_sync(ctx, job_id, 2000);
cv::Mat fgr({3, 512, 512}, CV_8UC1);
cv::Mat pha({1, 512, 512}, CV_8UC1);
noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 0, fgr.data);
noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 1, pha.data);
// noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 2, r1.data);
// noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 3, r2.data);
// noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 4, r3.data);
// noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 5, r4.data);
for (int y = 0; y < 512; y++)
{
unsigned char* p = (unsigned char*)out.data + y * 512 * 3;
const unsigned char* pr = (const unsigned char*)fgr.data + y * 512;
const unsigned char* pg = pr + 512 * 512;
const unsigned char* pb = pg + 512 * 512;
const unsigned char* ppha = (const unsigned char*)pha.data + y * 512;
for (int x = 0; x < 512; x++)
{
float alpha = *ppha++ / 255.f;
// 0~127 to 0~255
p[0] = std::min((int)*pr++, 127) * 255 / 127 * alpha + (1 - alpha) * 155;
p[1] = std::min((int)*pg++, 127) * 255 / 127 * alpha + (1 - alpha) * 255;
p[2] = std::min((int)*pb++, 127) * 255 / 127 * alpha + (1 - alpha) * 120;
p += 3;
}
}
}
public:
context_handler_t* ctx;
uint64_t graph_id;
uint64_t job_id;
cv::Mat r1;
cv::Mat r2;
cv::Mat r3;
cv::Mat r4;
};
int main()
{
RVM_ncnn rvm_cpu;
rvm_cpu.load(false);
RVM_ncnn rvm_gpu;
rvm_gpu.load(true);
RVM_noe rvm_npu;
rvm_npu.load();
cv::Mat bgr = cv::imread("/home/radxa/rvm/512.png", 1);
cv::Mat out_cpu(512, 512, CV_8UC3);
cv::Mat out_gpu(512, 512, CV_8UC3);
cv::Mat out_npu(512, 512, CV_8UC3);
rvm_cpu.run(bgr, out_cpu);
rvm_gpu.run(bgr, out_gpu);
rvm_npu.run(bgr, out_npu);
cv::imwrite("/home/radxa/rvm/out_cpu.png", out_cpu);
cv::imwrite("/home/radxa/rvm/out_gpu.png", out_gpu);
cv::imwrite("/home/radxa/rvm/out_npu.png", out_npu);
return 0;
}
cmake增加noe相关头文件引用和链接noe
cmake_minimum_required(VERSION 3.10)
project(rvm)
set(CMAKE_BUILD_TYPE Release)
find_package(OpenCV REQUIRED)
set(ncnn_DIR "/home/radxa/ncnn/build/install/lib/cmake/ncnn")
find_package(ncnn REQUIRED)
include_directories("/usr/share/cix/include")
link_directories("/usr/share/cix/lib")
add_executable(rvm rvm.cpp)
target_link_libraries(rvm ncnn ${OpenCV_LIBS} noe)
效果对比
输入图片
ncnn CPU 结果
ncnn GPU 结果
o6 NPU 结果
可以看到CPU和GPU结果一致,NPU的结果有部分区域效果裂化,这与模型量化导致的精度损失有关
RVM CPU/GPU/NPU性能比较
循环跑20次, run()
函数的最小耗时比较,可以看到NPU相对于CPU有3.75倍性能提升
模式 | 耗时(ms) |
---|---|
CPU | 240 |
GPU | 180 |
NPU | 64 |