nihui · 3 天前

【“星睿O6”评测】RVM人像分割torch➡️ncnn-CPU/GPU和o6-NPU部署全过程

标题精简了,原标题是
【“星睿O6”AI PC开发套件评测】RVM人像分割torch➡️pnnx➡️cix量化➡️o6-NPU和ncnn-CPU/GPU部署全过程

安谋科技、此芯科技与瑞莎计算机联合打造了面向AI PC、边缘、机器人等不同场景的“星睿O6”开发套件,该套件异构集成了Arm®v9 CPU核心、Arm Immortalis™ GPU以及安谋科技“周易”NPU

RVM 人像分割简介

稳定视频抠像(RVM)是一款功能强大的实时视频人像抠图技术,其由字节跳动项目组开发。不同于现有神经网络将每一帧作为单独图片处理,RVM使用循环神经网络,在处理视频流时有时间记忆。RVM可在任意视频上做实时高清抠像。在Nvidia GTX 1080Ti上实现4K 76FPS和HD 104FPS。

https://github.com/PeterL1n/R...
showreel.gif

torch 模型 ➡️ pnnx ➡️ onnx

官方提供了mobilenetv3和resnet50两种,这里选择精度更好,也比较适合NPU的resnet50模型

https://github.com/PeterL1n/R...

pip install torch torchvision
git clone https://github.com/PeterL1n/RobustVideoMatting
cd RobustVideoMatting
pip install -r requirements_inference.txt
wget https://github.com/PeterL1n/RobustVideoMatting/releases/download/v1.0.0/rvm_resnet50.pth

准备好运行环境和模型后,使用pnnx工具将torch模型优化一边,再导出到ncnn和onnx

PNNX(PyTorch Neural Network Exchange)是PyTorch模型部署的新的方式,可以避开ONNX中间商,导出比较干净的高层OP

pnnx工具能做强大的图优化,转换出的模型通常更适合推理部署,并且也支持导出torch代码,ncnn模型,onnx模型

https://github.com/pnnx/pnnx

pip install pnnx

在 RobustVideoMatting 目录中新建 export_pnnx.py 写个简单的推理过程和导出代码

输入以下文件:

  • rvm_resnet50.pth 原始RVM模型权重
  • 512.png 测试的人像图片

产生以下这些文件:

  • rvm_resnet50_pnnx.py 优化后的RVM模型python代码
  • rvm_resnet50.pnnx.param/pnnx.bin 优化后的RVM模型结构pnnx定义和权重
  • rvm_resnet50.ncnn.param/ncnn.bin 优化后导出的ncnn模型
  • rvm_resnet50.onnx 优化后的onnx模型
  • in.npy.npz 512.png经过前处理的tensor和首帧推理得到的4个特征tensor
import torch
from torch import nn
from model import MattingNetwork
from PIL import Image
from torchvision import transforms
import numpy
import pnnx

class Model(nn.Module):
    def __init__(self):
        super().__init__()

        self.rvm = MattingNetwork('resnet50').eval()
        self.rvm.load_state_dict(torch.load('rvm_resnet50.pth'))

    def forward_first_frame(self, src):
        return self.rvm(src)

    def forward(self, src, r1, r2, r3, r4):
        return self.rvm(src, r1, r2, r3, r4)

model = Model().eval()

with Image.open('512.png') as img:
    img.load()

transform = transforms.ToTensor()

x = transform(img).unsqueeze(0)

fgr, pha, r1, r2, r3, r4 = model.forward_first_frame(x)

opt_model = pnnx.export(model, "rvm_resnet50.pt", (x, r1, r2, r3, r4))

torch.onnx.export(opt_model, (x, r1, r2, r3, r4), "rvm_resnet50.onnx", export_params=True, opset_version=13, input_names=['in0', 'in1', 'in2', 'in3', 'in4'], output_names=['out0', 'out1', 'out2', 'out3', 'out4', 'out5'])

numpy.savez('in.npy', in0=x.numpy(), in1=r1.detach().numpy(), in2=r2.detach().numpy(), in3=r3.detach().numpy(), in4=r4.detach().numpy())

代码注解1 模型导出部分

RVM支持首帧和后续帧两种推理方式

  • 首帧:512x512 RGB 图片 + 初始化为全零的4个特征 ➡️ 前景 RGB 图片 + 分割MASK + 当前帧4个特征
  • 后续帧:512x512 RGB 图片 + 前一帧的4个特征 ➡️ 前景 RGB 图片 + 分割MASK + 当前帧4个特征

如果采用首帧模式导出模型,模型中将会含有大量数值为零的特征,容易导致后续优化和量化的异常。实际导出应该使用后续帧的模式。

代码注解2 pnnx优化部分

opt_model = pnnx.export(model, "rvm_resnet50.pt", (x, r1, r2, r3, r4))

这行会将原始RVM模型,通过pnnx优化,返回优化后的RVM模型。优化后的RVM模型是标准的torch nn module,能直接接替后续模型操作,比如导出onnx

代码注解3 校准数据准备

numpy.savez('in.npy', in0=x.numpy(), in1=r1.detach().numpy(), in2=r2.detach().numpy(), in3=r3.detach().numpy(), in4=r4.detach().numpy())

后续帧模型需要图片+4个特征,总共5个输入。通常单输入模型直接用 numpy.save('in0.npy', in0.numpy()) 存出单个npy即可,而多输入需要存出 npz 打包格式,后面给 cix 量化工具会需要

cix 量化

编写一个 rvm_resnet50.cfg 配置文件

[Common]
mode = build

[Parser]
model_type = onnx
model_name = rvm_resnet50
detection_postprocess =
model_domain = object_detection
input_model = ./rvm_resnet50.onnx
output_dir = ./
input_shape = [1, 3, 512, 512],[1, 16, 256, 256],[1, 32, 128, 128],[1, 64, 64, 64],[1, 128, 32, 32]
input = in0,in1,in2,in3,in4

[Optimizer]
calibration_data = in.npy.npz
calibration_batch_size = 1
metric_batch_size = 1
output_dir = ./
dataset = numpymultiinputdataset
save_statistic_info = True
cast_dtypes_for_lib = True
quantize_method_for_activation = per_tensor_asymmetric
quantize_method_for_weight = per_channel_symmetric_restricted_range

[GBuilder]
target = X2_1204MP3
outputs = rvm_resnet50.cix
profile = True
tiling = fps

配置注解1 多输入的写法

input_shape = [1, 3, 512, 512],[1, 16, 256, 256],[1, 32, 128, 128],[1, 64, 64, 64],[1, 128, 32, 32]
input = in0,in1,in2,in3,in4

用逗号分隔多个shape和输入node名字,shape信息可以从上面生成的 rvm_resnet50_pnnx.py 的最后一段中直接找到,node名字应当与torch.onnx.export导出时设置的input_names对应

def test_inference():
    net = Model()
    net.float()
    net.eval()

    torch.manual_seed(0)
    v_0 = torch.rand(1, 3, 512, 512, dtype=torch.float)
    v_1 = torch.rand(1, 16, 256, 256, dtype=torch.float)
    v_2 = torch.rand(1, 32, 128, 128, dtype=torch.float)
    v_3 = torch.rand(1, 64, 64, 64, dtype=torch.float)
    v_4 = torch.rand(1, 128, 32, 32, dtype=torch.float)

    return net(v_0, v_1, v_2, v_3, v_4)

if __name__ == "__main__":
    print(test_inference())

配置注解2 多输入的校准数据

cix工具支持numpymultiinputdataset数据格式

quantize_method_for_activation 和 quantize_method_for_weight 的两个设置能提升量化后的模型精度

resnet50结构中大量使用 ReLU 激活,activation asymmetric 模式能正好节约 relu 的运算

calibration_data = in.npy.npz
dataset = numpymultiinputdataset
quantize_method_for_activation = per_tensor_asymmetric
quantize_method_for_weight = per_channel_symmetric_restricted_range

libaipu_simulator_x2.so 错误解决

更新到 CixBuilder-6.1.3119.3-py3-none-linux_x86_64.whl 之后使用 cixbuild 报错找不到 libaipu_simulator_x2.so

cixbuild rvm_resnet50.cfg
Traceback (most recent call last):
  File "/home/nihui/.local/bin/cixbuild", line 8, in <module>
    sys.exit(build_main())
  File "/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/build_cix.py", line 321, in build_main
    main()
  File "AIPUBuilder/main.pyx", line 89, in AIPUBuilder.main.main
  File "/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/model/__init__.py", line 7, in <module>
    from .aipullm import AIPULLM
  File "AIPUBuilder/model/aipullm.pyx", line 9, in init AIPUBuilder.model.aipullm
  File "/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/cruntime/__init__.py", line 1, in <module>
    from AIPUBuilder._C.CompassLLM import (
ImportError: libaipu_simulator_x2.so: cannot open shared object file: No such file or directory

经调查发现,上一个版本并没有这个依赖,是这个版本新加的,找到so路径

export LD_LIBRARY_PATH=/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
cixbuild rvm_resnet50.cfg

输出信息,最后得到 rvm_resnet50.cix

nihui@nihui-pc:~/dev/o6-test$ export LD_LIBRARY_PATH=/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
nihui@nihui-pc:~/dev/o6-test$ cixbuild rvm_resnet50.cfg
[I] Build with version 6.1.3119
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model rvm_resnet50...
2025-04-15 22:47:46.411728: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/nihui/.local/lib/python3.8/site-packages/cv2/../../lib64:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-15 22:47:46.411753: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-04-15 22:47:47.518775: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/nihui/.local/lib/python3.8/site-packages/cv2/../../lib64:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib:/home/nihui/.local/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-15 22:47:47.518799: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2025-04-15 22:47:47.518820: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (nihui-pc): /proc/driver/nvidia/version does not exist
2025-04-15 22:47:48.197359: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[I] [Parser]: The input tensor(s) is/are: in0_0,in1_0,in2_0,in3_0,in4_0
[I] [Parser]: Input in0 from cfg is shown as tensor in0_0 in IR!
[I] [Parser]: Input in1 from cfg is shown as tensor in1_0 in IR!
[I] [Parser]: Input in2 from cfg is shown as tensor in2_0 in IR!
[I] [Parser]: Input in3 from cfg is shown as tensor in3_0 in IR!
[I] [Parser]: Input in4 from cfg is shown as tensor in4_0 in IR!
[I] [Parser]: 0 error(s), 0 warning(s) generated.
[I] [Parser]: Parser done!
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_4_15_22_47_45_xk63i/rvm_resnet50.txt
[I] [IRChecker] model_name: rvm_resnet50
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i/rvm_resnet50.bin size: 0x667a290
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------
            OpType.Transpose:  -12
              OpType.Eltwise:   -1
                  OpType.Mul:   +1
              OpType.Reshape:   +1
                 OpType.Tile:   -1
------------------------------------------------------------------------
[I] Simplify Done.
[I] Simplify float model Done.
[I] Optimizing model....
[I] [OPT] [22:47:54]: [arg_parser] is running.
[I] [OPT] [22:47:54]: tool name: Compass-Optimizer, version: 1.3.3119, use cuda: False, running device: cpu
[I] [OPT] [22:47:54]: [quantization config Info][model name]: rvm_resnet50, [quantization method for weight]: per_channel_symmetric_restricted_range, [quantization method for activation]: per_tensor_asymmetric, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8

[I] [OPT] [22:47:54]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
[I] [OPT] [22:47:54]: IR loaded.
Building graph: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 1466.19it/s]
[I] [OPT] [22:47:54]: Begin to load weights.
[I] [OPT] [22:47:54]: Weights loaded.
Deserializing bin: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 146/146 [00:00<00:00, 7408.10it/s]
[I] [OPT] [22:47:54]: Successfully parsed IR with python API.
[I] [OPT] [22:47:54]: init graph by forwarding one sample filled with zeros
forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:01<00:00, 180.11it/s]
[I] [OPT] [22:47:55]: [graph_optimize_stage1] is running.
[I] [OPT] [22:47:55]: [statistic] is running.
statistic batch: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:03<00:00,  3.27s/it]
[I] [OPT] [22:47:59]: [graph_optimize_stage2] is running.
[I] [OPT] [22:47:59]: applying calibration strategy based on statistic info
calibration: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:00<00:00, 25190.19it/s]
[I] [OPT] [22:47:59]: [quantize] is running.
update_tensor_quantization_attrs:   0%|                                                                                                                                                                                                                                                             | 0/184 [00:00<?, ?it/s][W] [OPT] [22:47:59]: due to hardware limitations, it is actually doing per-2-channel quantization, which may cause accuracy dropping: layer_id=7, type=OpType.Convolution, name=/convbn2d_0/Conv_clone_, rescale values differ sharpely whithin channels,

.... 省略很多warning

quantize each layer: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 193/193 [00:00<00:00, 1909.74it/s]
[I] [OPT] [22:48:00]: collecting per-layer similarity infomation between float graph and quanted graph by forwarding 1 sample on both of them
[I] [OPT] [22:48:03]: [graph_optimize_stage3] is running.
[I] [OPT] [22:48:03]: [serialize] is running.
[I] [OPT] [22:48:03]: check the final graph by forwarding one sample filled with zeros
forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 184/184 [00:01<00:00, 103.22it/s]
[I] [OPT] [22:48:05]: Begin to serialzie IR
Writing IR: 184it [00:00, 5817.17it/s]
Serializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 593/593 [00:00<00:00, 54544.35it/s]
[I] [OPT] [22:48:05]: IR has been saved into /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i
[I] [OPT] [22:48:05]: Compass-Optimizer has done at [serialize] period.
[I] [OPT] [22:48:05]: [Done]cost time: 20s, and [qinfos(scale, zp, dtype)]: out: [[255.0, 0, UINT8], [255.0, 0, UINT8], [127.50077819824219, 1, INT8], [127.5284652709961, 0, INT8], [127.5, 0, INT8], [127.50006103515625, 1, INT8]] in: [[255.0, 0, UINT8], [127.51888275146484, 1, INT8], [127.72821044921875, 0, INT8], [127.50000762939453, 1, INT8], [127.50524139404297, 0, INT8]] [output tensors cosine]: [0.9980738479792883, 0.9999214568249806, 0.9992704966369025, 0.9984071475614615, 0.9973715154248336, 0.994706231139195][output tensors MSE]: [0.0005080156843177974, 4.108408393221907e-05, 0.0007104792748577893, 0.0008333915611729026, 0.001465060980990529, 0.0011862480314448476]
[I] Optimizing model complete
[I] Simplifying quant model...
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant.txt
[I] [IRChecker] model_name: rvm_resnet50
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant.bin size: 0x1a1cad0
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] GSim simplified result:
------------------------------------------------------------------------
            OpType.Transpose:  -16
           OpType.Activation:   -1
------------------------------------------------------------------------
[I] Simplify Done.
[I] Simplify quant model Done.
[I] Building ...
[I] [IRChecker] Start to check IR: /home/nihui/dev/o6-test/internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant_s.txt
[I] [IRChecker] model_name: rvm_resnet50
[I] [IRChecker] IRChecker: All IR pass
[I] [tools.cpp : 352] BuildTool version: 6.1.3119. Build for target X2_1204MP3 PID: 17449
[I] [tools.cpp : 372] using default profile events to profile default
[I] [tools.cpp : 834] global cwd: /tmp/b098c1a7ef02ba7d25d0b238c21a27d9a521a043171e820632d2419e6fde
[I] [graph.cpp :1600] loading graph weight: /home/nihui/dev/o6-test/./internal_2025_4_15_22_47_45_xk63i/rvm_resnet50_quant_s.bin size: 0x1a1c9ac
[I] [tiling.cpp:4500] Auto tiling now, please wait ...
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_0 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_1 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_2 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_3 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_4 maybe this merge point could be removed
[W] [tiling.cpp:3918] merge point: /Add_2/tile/concat have a tiled child: /Concat_7/tile/in_crop_pre/1_5 maybe this merge point could be removed

.... 省略很多warning

[I] [layoutconvertor.cpp: 276] Building /Concat_15/tile/out/9/pad/layout/NCHWC32T8...
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/11/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/7/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/8/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [aipu_plugin_tpc.cpp: 173] LayoutConvertor(/Concat_15/tile/out/9/pad/layout/NCHWC32T8)uses tensor-process-lib
[I] [builder.cpp:1938] The graph DDR Footprint requirement(estimation) of feature maps:
[I] [builder.cpp:1939]     Read and Write:676.86MB
[I] [builder.cpp:1080] Reduce constants memory size: 42.164MB
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp: 142] name offset not found
[W] [ar_reader.cpp:  63] /usr/bin//../lib//libmcheck.ais not a archive file.
[I] [builder.cpp:2411] memory statistics for this graph (rvm_resnet50)
[I] [builder.cpp: 585] Total memory     :       0x020ee424 Bytes ( 32.931MB)
[I] [builder.cpp: 585] Text      section:       0x000ce550 Bytes (  0.806MB)
[I] [builder.cpp: 585] RO        section:       0x00031a00 Bytes (  0.194MB)
[I] [builder.cpp: 585] Desc      section:       0x00112a00 Bytes (  1.073MB)
[I] [builder.cpp: 585] Data      section:       0x01e4dc20 Bytes ( 30.304MB)
[I] [builder.cpp: 585] BSS       section:       0x0004dab4 Bytes (  0.303MB)
[I] [builder.cpp: 585] Stack            :       0x00040400 Bytes (  0.251MB)
[I] [builder.cpp: 585] Workspace(BSS)   :       0x00100000 Bytes (  1.000MB)
[I] [builder.cpp:2427]
[I] [tools.cpp :1181]  -  compile time: 2.914 s
[I] [tools.cpp :1087] With GM optimization, DDR Footprint stastic(estimation):
[I] [tools.cpp :1094]     Read and Write:717.02MB
[I] [tools.cpp :1137]  -  draw graph time: 0.083 s
[I] [tools.cpp :1954] remove global cwd: /tmp/b098c1a7ef02ba7d25d0b238c21a27d9a521a043171e820632d2419e6fde
build success.......
Total errors: 0,  warnings: 128

o6 ncnn CPU/GPU C++部署

让我们快速编译个ncnn库吧!

git clone https://github.com/Tencent/ncnn.git
cd ncnn
git submodule update --init --recursive --depth 1
mkdir build
cd build
cmake -DNCNN_VULKAN=ON ..
make -j12
make install

再搞个cmake工程配置

cmake_minimum_required(VERSION 3.10)
project(rvm)

set(CMAKE_BUILD_TYPE Release)

find_package(OpenCV REQUIRED)

set(ncnn_DIR "/home/radxa/ncnn/build/install/lib/cmake/ncnn")
find_package(ncnn REQUIRED)

add_executable(rvm rvm.cpp)
target_link_libraries(rvm ncnn ${OpenCV_LIBS})

写个ncnn rvm推理代码,输入BGR,输出合成后的BGR,支持CPU/GPU

#include "net.h"

class RVM_ncnn
{
public:
    void load(bool use_gpu = false)
    {
        net.opt.use_vulkan_compute = use_gpu;
        net.load_param("/home/radxa/rvm/rvm_resnet50.ncnn.param");
        net.load_model("/home/radxa/rvm/rvm_resnet50.ncnn.bin");

        r1 = ncnn::Mat(256, 256, 16);
        r2 = ncnn::Mat(128, 128, 32);
        r3 = ncnn::Mat(64, 64, 64);
        r4 = ncnn::Mat(32, 32, 128);
        r1.fill(0.0f);
        r2.fill(0.0f);
        r3.fill(0.0f);
        r4.fill(0.0f);
    }

    void run(const cv::Mat& bgr, cv::Mat& out)
    {
        ncnn::Extractor ex = net.create_extractor();

        ncnn::Mat in0 = ncnn::Mat::from_pixels(bgr.data, ncnn::Mat::PIXEL_BGR2RGB, 512, 512);

        const float mean_vals[3] = {0, 0, 0};
        const float norm_vals[3] = {1 / 255.0, 1 / 255.0, 1 / 255.0};
        in0.substract_mean_normalize(mean_vals, norm_vals);

        ex.input("in0", in0);
        ex.input("in1", r1);
        ex.input("in2", r2);
        ex.input("in3", r3);
        ex.input("in4", r4);

        ncnn::Mat fgr;
        ncnn::Mat pha;
        ex.extract("out0", fgr);
        ex.extract("out1", pha);

        const float demean_vals[3] = {0, 0, 0};
        const float denorm_vals[3] = {255.0, 255.0, 255.0};
        fgr.substract_mean_normalize(demean_vals, denorm_vals);

        fgr.to_pixels(out.data, ncnn::Mat::PIXEL_RGB2BGR);

        // composite
        for (int y = 0; y < 512; y++)
        {
            unsigned char* p = (unsigned char*)out.data + y * 512 * 3;
            const float* ppha = (const float*)pha.data + y * 512;
            for (int x = 0; x < 512; x++)
            {
                float alpha = *ppha++;

                // 0~127 to 0~255
                p[0] = p[0] * alpha + (1 - alpha) * 155;
                p[1] = p[1] * alpha + (1 - alpha) * 255;
                p[2] = p[2] * alpha + (1 - alpha) * 120;
                p += 3;
            }
        }
    }

private:
    ncnn::Net net;

    ncnn::Mat r1;
    ncnn::Mat r2;
    ncnn::Mat r3;
    ncnn::Mat r4;
};

o6 NPU C++部署

写个cix npu rvm推理代码,输入BGR,输出合成后的BGR

输入和输出的 int8/uint8 缩放系数可以从cixbuild生成的 internal/rvm_resnet50_quant_s.txt中找到

#include <stdio.h>

#include <opencv2/core/core.hpp>
#include <opencv2/highgui/highgui.hpp>
#include <opencv2/imgproc/imgproc.hpp>

#include <npu/cix_noe_standard_api.h>

class RVM_noe
{
public:
    RVM_noe()
    {
        ctx = 0;
        graph_id = 0;
        job_id = 0;

        noe_init_context(&ctx);
    }

    ~RVM_noe()
    {
        noe_clean_job(ctx, job_id);
        noe_unload_graph(ctx, graph_id);
        noe_deinit_context(ctx);
    }

    void load()
    {
        noe_load_graph(ctx, "/home/radxa/rvm/rvm_resnet50.cix", &graph_id);

        noe_dynshape_param_t dynshape = {0, 0};

        job_config_npu_t job_cfg_npu;
        job_cfg_npu.partition_id = 0;
        job_cfg_npu.dbg_dispatch = 0;
        job_cfg_npu.dbg_core_id = 0;
        job_cfg_npu.fm_idxes = 0;
        job_cfg_npu.fm_idxes_cnt = 0;
        job_cfg_npu.dynshape = &dynshape;

        job_config_t job_cfg = {&job_cfg_npu};

        noe_create_job(ctx, graph_id, &job_id, &job_cfg);

        r1 = cv::Mat({16, 256, 256}, CV_8UC1);
        r2 = cv::Mat({32, 128, 128}, CV_8UC1);
        r3 = cv::Mat({64, 64, 64}, CV_8UC1);
        r4 = cv::Mat({128, 32, 32}, CV_8UC1);
        r1 = cv::Scalar(0);
        r2 = cv::Scalar(0);
        r3 = cv::Scalar(0);
        r4 = cv::Scalar(0);
    }

    void run(const cv::Mat& bgr, cv::Mat& out)
    {
        cv::Mat rgb({3, 512, 512}, CV_8UC1);

        for (int y = 0; y < 512; y++)
        {
            const unsigned char* p = (const unsigned char*)bgr.data + y * 512 * 3;
            signed char* pr = (signed char*)rgb.data + y * 512;
            signed char* pg = pr + 512 * 512;
            signed char* pb = pg + 512 * 512;
            for (int x = 0; x < 512; x++)
            {
                // 0~255 to 0~127
                *pr++ = p[0] * 127 / 255;
                *pg++ = p[1] * 127 / 255;
                *pb++ = p[2] * 127 / 255;
                p += 3;
            }
        }

        noe_load_tensor(ctx, job_id, 0, rgb.data);
        noe_load_tensor(ctx, job_id, 1, r1.data);
        noe_load_tensor(ctx, job_id, 2, r2.data);
        noe_load_tensor(ctx, job_id, 3, r3.data);
        noe_load_tensor(ctx, job_id, 4, r4.data);

        noe_job_infer_sync(ctx, job_id, 2000);

        cv::Mat fgr({3, 512, 512}, CV_8UC1);
        cv::Mat pha({1, 512, 512}, CV_8UC1);

        noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 0, fgr.data);
        noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 1, pha.data);
        // noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 2, r1.data);
        // noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 3, r2.data);
        // noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 4, r3.data);
        // noe_get_tensor(ctx, job_id, NOE_TENSOR_TYPE_OUTPUT, 5, r4.data);

        for (int y = 0; y < 512; y++)
        {
            unsigned char* p = (unsigned char*)out.data + y * 512 * 3;
            const unsigned char* pr = (const unsigned char*)fgr.data + y * 512;
            const unsigned char* pg = pr + 512 * 512;
            const unsigned char* pb = pg + 512 * 512;
            const unsigned char* ppha = (const unsigned char*)pha.data + y * 512;
            for (int x = 0; x < 512; x++)
            {
                float alpha = *ppha++ / 255.f;

                // 0~127 to 0~255
                p[0] = std::min((int)*pr++, 127) * 255 / 127 * alpha + (1 - alpha) * 155;
                p[1] = std::min((int)*pg++, 127) * 255 / 127 * alpha + (1 - alpha) * 255;
                p[2] = std::min((int)*pb++, 127) * 255 / 127 * alpha + (1 - alpha) * 120;
                p += 3;
            }
        }
    }

public:
    context_handler_t* ctx;
    uint64_t graph_id;
    uint64_t job_id;

    cv::Mat r1;
    cv::Mat r2;
    cv::Mat r3;
    cv::Mat r4;
};

int main()
{
    RVM_ncnn rvm_cpu;
    rvm_cpu.load(false);

    RVM_ncnn rvm_gpu;
    rvm_gpu.load(true);

    RVM_noe rvm_npu;
    rvm_npu.load();

    cv::Mat bgr = cv::imread("/home/radxa/rvm/512.png", 1);

    cv::Mat out_cpu(512, 512, CV_8UC3);
    cv::Mat out_gpu(512, 512, CV_8UC3);
    cv::Mat out_npu(512, 512, CV_8UC3);

    rvm_cpu.run(bgr, out_cpu);
    rvm_gpu.run(bgr, out_gpu);
    rvm_npu.run(bgr, out_npu);

    cv::imwrite("/home/radxa/rvm/out_cpu.png", out_cpu);
    cv::imwrite("/home/radxa/rvm/out_gpu.png", out_gpu);
    cv::imwrite("/home/radxa/rvm/out_npu.png", out_npu);

    return 0;
}

cmake增加noe相关头文件引用和链接noe

cmake_minimum_required(VERSION 3.10)
project(rvm)

set(CMAKE_BUILD_TYPE Release)

find_package(OpenCV REQUIRED)

set(ncnn_DIR "/home/radxa/ncnn/build/install/lib/cmake/ncnn")
find_package(ncnn REQUIRED)

include_directories("/usr/share/cix/include")
link_directories("/usr/share/cix/lib")

add_executable(rvm rvm.cpp)
target_link_libraries(rvm ncnn ${OpenCV_LIBS} noe)

效果对比

输入图片

512.png

ncnn CPU 结果

out_cpu.png

ncnn GPU 结果

out_gpu.png

o6 NPU 结果

out_npu.png

可以看到CPU和GPU结果一致,NPU的结果有部分区域效果裂化,这与模型量化导致的精度损失有关

RVM CPU/GPU/NPU性能比较

循环跑20次, run() 函数的最小耗时比较,可以看到NPU相对于CPU有3.75倍性能提升

模式耗时(ms)
CPU240
GPU180
NPU64

o6.png

推荐阅读
关注数
1169
内容数
21
搭载安谋科技“周易”NPU的此芯AI PC开发套件瑞莎星睿O6开发板文档、活动及评测等专栏,加开发者群请添加极术小姐姐(id:aijishu20))。
目录
极术微信服务号
关注极术微信号
实时接收点赞提醒和评论通知
安谋科技学堂公众号
关注安谋科技学堂
实时获取安谋科技及 Arm 教学资源
安谋科技招聘公众号
关注安谋科技招聘
实时获取安谋科技中国职位信息