【“星睿O6”AI PC开发套件评测】+ tensorflow 模型 NPU 部署

在上一篇文章中介绍了 tensorflow 环境搭建和测试。本文主要介绍如何将模型部署到 NPU，以及简单的性能对比测试分为几个部分：

测试官方 onnx_resnet_v1_50 模型部署流程，以及 NPU 、 CPU 对比测试
将自己的 mnist 模型部署到 “星睿O6” NPU 以及和 CPU 的对比测试

官方模型

通过早鸟计划，下载 NeuralOne AI SDK，在 PC 机上安装 CixBuilder-6.1.3119.3-py3-none-linux_x86_64.whl。
1. 这里我使用 conda 在台式机上创建了一个虚拟的 python3.8 环境：
```
conda create --name radxa_red python=3.8
conda activate radxa_red
```
2. 在虚拟环境下安装依赖：
```
pip3 install -r requirements.txt 
```
3. 测试 cixbuild 版本，确认 cixbuild 可以正常运行
```
▸ cixbuild -v
/home/red/.conda/envs/radxa_red/bin/cixbuild   version: 6.1.3119
```
cixbuild 构建工具正确运行之后，克隆 ai_model_hub 仓库，进入到目录models/ComputeVision/Image_Classification/onnx_resnet_v1_50，根据 readme，从 https://github.com/onnx/model... 下载 resnet50-v1-12.onnx 模型到 model 目录（需要新建）
1. 安装 onnxsim 并对模型进行简化：onnxsim model/resnet50-v1-12.onnx model/resnet50-v1-12-sim.onnx
2. 开始执行构建 cix 模型：cixbuild cfg/onnx_resnet_v1_50build.cfg,因为根据配置文件，我看 datasets 目录下本来就有 calib_data.npy 校准数据，构建的时候有报错 ValueError: Cannot load file containing pickled data when allow_pickle=False原因是仓库默认的 npy 不支持序列化，使用readme 中的代码，将 test_data 目录下的代码存储为 numpy 格式的二进制文件，执行的时候提示缺少 imageio，重新安装 pip3 install imageio
3. 再次重新构建就顺利构建出来了，就可以得到 resnet_v1_50.cix 模型文件

在 PC 端分别测试原始 onnx 模型和 cix 模型

原始 onnx 模型测试：平均耗时：0.013521631558736166 秒

image path : test_data/ILSVRC2012_val_00002899.JPEG
rock python, rock snake, Python sebae
image path : test_data/ILSVRC2012_val_00004704.JPEG
plunger, plumber's helper
image path : test_data/ILSVRC2012_val_00021564.JPEG
coucal
image path : test_data/ILSVRC2012_val_00024154.JPEG
Ibizan hound, Ibizan Podenco
image path : test_data/ILSVRC2012_val_00037133.JPEG
ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
image path : test_data/ILSVRC2012_val_00045790.JPEG
Yorkshire terrier
[0.013648509979248047, 0.012284517288208008, 0.015871047973632812, 0.011651992797851562, 0.01418757438659668, 0.013486146926879883]
0.013521631558736166

O6 上分别测试原始 onnx 模型和 npu 执行 cix 模型

CPU 执行原始 onnx 模型，平均单次推理耗时 0.14705955982208252 秒

image path : test_data/ILSVRC2012_val_00004704.JPEG
plunger, plumber's helper
image path : test_data/ILSVRC2012_val_00021564.JPEG
coucal
image path : test_data/ILSVRC2012_val_00024154.JPEG
Ibizan hound, Ibizan Podenco
image path : test_data/ILSVRC2012_val_00037133.JPEG
ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
image path : test_data/ILSVRC2012_val_00002899.JPEG
rock python, rock snake, Python sebae
image path : test_data/ILSVRC2012_val_00045790.JPEG
Yorkshire terrier
[0.16982293128967285, 0.14194154739379883, 0.1459333896636963, 0.14214777946472168, 0.14238333702087402, 0.14012837409973145]
0.14705955982208252

NPU 执行 cix 模型，平均单次推理时间是 0.0043375492095947266 秒

▸ python3 inference_npu.py --images test_data --model_path resnet_v1_50.cix
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
image path : test_data/ILSVRC2012_val_00004704.JPEG
plunger, plumber's helper
image path : test_data/ILSVRC2012_val_00021564.JPEG
coucal
image path : test_data/ILSVRC2012_val_00024154.JPEG
Ibizan hound, Ibizan Podenco
image path : test_data/ILSVRC2012_val_00037133.JPEG
ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
image path : test_data/ILSVRC2012_val_00002899.JPEG
rock python, rock snake, Python sebae
image path : test_data/ILSVRC2012_val_00045790.JPEG
Yorkshire terrier
[0.00441288948059082, 0.004355669021606445, 0.004000186920166016, 0.0040247440338134766, 0.004605293273925781, 0.00462651252746582]
0.0043375492095947266
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success

自己的 mnist 模型

有了官方模型的适配基础，对整个部署流程就有了一个基本的了解，那么怎么适配自己的模型呢，这部分就记录下我的探索之路。

根据官方的文档《NPU SDK User Guide》可以知道，是支持 TensorFlow version 1.0–2.6 模型的，但是 CixBuilder-6.1.3119.3-py3-none-linux_x86_64 构建工具依赖的却是 tensorflow 2.7.0，我尝试直接编译 tensorflow 模型，一直报错，检索了 ai_model_hub 中的内容，发现也没有直接构建 tensorflow 模型例子，无奈之下，我尝试使用 tf2onnx 将 tensorflow 模型转化为 onnx 格式再进行构建，后面证明这个路子是行的通的，下面开始记录下，我是如何将我写的 tensorflow mnist 模型部署到 “星睿O6” 的 npu 上执行的。
首先根据 tf2onnx 仓库描述支持 TensorFlow 2.9-2.15 以及 Python 3.7-3.10，于是我使用 conda 又创建了 python 3.8 的虚拟环境（PS：至此已经两个 python 虚拟环境了， cixbuild 是一个 python3.8 tensorflow2.7 的虚拟环境，用来构建 cix 模型；python3.8 tensorflow2.9 用来转换 tensorflow 模型到 onnx 格式）

在新建的 conda 虚拟环境中，使用如下代码，将之前构建的 mnist 模型进行转化：

import tensorflow as tf
import tf2onnx

# Load your TensorFlow Keras model or SavedModel
model = tf.keras.models.load_model('model') # Or tf.saved_model.load('your_saved_model')

# Define the input signature
# 明确输入的 shape，根据我的实测，如果设置 shape=(None,1,28,28) 会导致转换到 onnx 模型时，出现
# layer_top_shape=[[0,1,28,28]] 就会报错，所以这里限定 shape
input_signature = [tf.TensorSpec(shape=(1, 1, 28, 28), dtype=tf.float32, name='input_1')]

# Convert the model to ONNX
onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=input_signature, opset=15)

# Save the ONNX model
with open("model00.onnx", "wb") as f:
 f.write(onnx_model.SerializeToString())

通过上述代码，就可以将 model 目录下的 tensorflow saved model 格式模型转换为 model00.onnx 文件。转换到 onnx 模型之后，查看下输入、输出节点，以及他们的名字，这里使用下面的代码：

#!/home/Red/.conda/envs/radxa_red/bin/python3

import onnx
import sys
from os.path import isfile

# Load the ONNX model
model_path = "your_model.onnx"  # Replace with the actual path to your ONNX model

if len(sys.argv) > 1:
 print(sys.argv[1])
 model_path=sys.argv[1]

if isfile(model_path):
 onnx_model = onnx.load(model_path)
else:
 print("no such file ", model_path)
 exit(-1)

# Get the input information
inputs = onnx_model.graph.input
outputs = onnx_model.graph.output

for output in outputs:
 print(f"Output Name: {output.name}")

print("Input shapes of the ONNX model:")
for input in inputs:
 print(f"  Name: {input.name}")
 if input.type.HasField("tensor_type"):
     shape = [d.dim_value if d.HasField("dim_value") else d.dim_param for d in input.type.tensor_type.shape.dim]
     print(f"    Shape: {shape}")
 else:
     print("    Shape: Unknown")

执行测试，可以得到下面的结果：

▸ ./get_onnx_inputshape.py ~/Samba/ai_code_pc/model00.onnx
/home/Red/Samba/ai_code_pc/model00.onnx
Output Name: dense_1
Input shapes of the ONNX model:
 Name: input_1
 Shape: [1, 1, 28, 28]

可以看出，输入 tensor 名字时 input_1，输出是 dense_1，输入 shape 是 [1,1,28,28]，下一步编写 tensorflow_build.cfg 文件

参考官方模型的配置文件，这里 mnist 模型的配置文件，是这样的：

[Common]
mode = build

[Parser]
model_type = onnx
model_name = tensorflow_build
detection_postprocess =
model_domain = image_classification
input_model = model00.onnx
output = dense_1
output_dir = ./
input_shape = [1, 1, 28, 28]
input = flatten_input

[Optimizer]
output_dir = ./
calibration_data = datasets/mnist_red.npy
calibration_batch_size = 1
dataset = numpydataset
save_statistic_info = True
cast_dtypes_for_lib = True
# global_calibration = adaround[10, 10, 32, 0.01]

[GBuilder]
target = X2_1204MP3
outputs = mnist.cix
tiling = fps
profile = True

这里主要关键的一点是 calibration_data，这里在构建 tensorflow 模型时，我将部分测试输入数据保存为了 npy 格式，对应这里的数据集字段。这部分代码，从构建模型的代码中可以看到，模型代码是这样的：

import tensorflow as tf
from PIL import Image
import numpy as np
import matplotlib.pyplot as plt
import time
from os.path import isfile
from os.path import isdir

print("TensorFlow version:", tf.__version__)


def save_grayscale_array_to_image(gray_array, filename="grayscale_image.png"):
    """
    Saves a 2D NumPy array (representing grayscale data) to an image file.

    Args:
        gray_array (numpy.ndarray): A 2D NumPy array where each element
                                     represents the intensity of a pixel (0-255).
        filename (str): The name of the file to save the image to.
                        Common formats are 'png', 'jpg', 'bmp', etc.
    """
    try:
        if gray_array.dtype != np.uint8:
            gray_array = gray_array.astype(np.uint8)

        img = Image.fromarray(gray_array, mode="L")

        img.save(filename)
        print(f"Grayscale image saved successfully as '{filename}'")

    except Exception as e:
        print(f"Error saving grayscale image: {e}")

def plt_predict(x, y, ya):
    ax = fig.add_subplot(1, 1, 1)
    plt.savefig("predict.png", dpi=300)


mnist = tf.keras.datasets.mnist

x = np.linspace(0, 9, 10)

(x_train, y_train), (x_test, y_test) = mnist.load_data()

x_train, x_test = x_train / 255.0, x_test / 255.0

list_x_train = []
for item in x_train:
    a = item.reshape(1, 28, 28)
    list_x_train.append(a)

np_x_train = np.array(list_x_train)

list_x_test = []
for item in x_test:
    a = item.reshape(1, 28, 28)
    list_x_test.append(a)

np_x_test = np.array(list_x_test)

if isdir("model/"):
    print("exist model/ dir")
    model = tf.keras.models.load_model("model/")
    print("load model from model/ dir success")
    print(model.summary())
else:
    model = tf.keras.models.Sequential(
        [
            tf.keras.layers.Flatten(input_shape=(1, 28, 28)),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dropout(0.2),
            tf.keras.layers.Dense(10),
        ]
    )

    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)

    tf_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs")
    model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
    model.fit(np_x_train, y_train, epochs=5, callbacks=[tf_callback], verbose=1)
    model.save("model/")
    print("save to model/ dir success")

model.evaluate(np_x_test, y_test, verbose=2)
# 保存测试数据集到文件
np.save("datasets/mnist_red.npy", np_x_test)

一切准备就绪就开始构建 cix 模型文件了，执行构建过程如下：

▸ cixbuild cfg/tensorflow_build.cfg
[I] Build with version 6.1.3119
[I] Parsing model....
[I] [Parser]: Begin to parse onnx model tensorflow_build...
2025-04-26 00:59:37.604835: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No su
ch file or directory; LD_LIBRARY_PATH: /home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/cv2/../../lib64:/home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/home
/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-26 00:59:37.604858: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2025-04-26 00:59:38.201630: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or
 directory; LD_LIBRARY_PATH: /home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/cv2/../../lib64:/home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/home/Red/.con
da/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
2025-04-26 00:59:38.201651: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2025-04-26 00:59:38.201661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (linux): /proc/driver/nvidia/version does not exist
[W] [Parser]: Input name (flatten_input) does not exit in node names or tensor names! Will ignore it!
[W] [Parser]: The output name dense_1 is not a node but a tensor. However, we will use the node sequential/dense_1/MatMul_Gemm__7 as output node.
2025-04-26 00:59:38.375478: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in pe
rformance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
[W] [Parser]: the input node(s) has changed or not set, please check the IR to confirm the input tensors order.
[I] [Parser]: The input tensor(s) is/are: input_1_0
[I] [Parser]: Input flatten_input from cfg is removed!
[I] [Parser]: Output dense_1 from cfg is shown as tensor sequential/dense_1/MatMul_Gemm__7_0 in IR!
[I] [Parser]: 0 error(s), 3 warning(s) generated.
[I] [Parser]: Parser done!
[I] Parse model complete
[I] Simplifying float model.
[I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build.txt
[I] [IRChecker] model_name: tensorflow_build
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build.bin size: 0x63628
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] Simplify Done.
[I] Simplify float model Done.
[I] Optimizing model....
[I] [OPT] [00:59:38]: [arg_parser] is running.
[I] [OPT] [00:59:38]: tool name: Compass-Optimizer, version: 1.3.3119, use cuda: False, running device: cpu
[I] [OPT] [00:59:38]: [quantization config Info][model name]: tensorflow_build, [quantization method for weight]: per_tensor_symmetric_restricted_range, [quantization method for activation]: per_tensor_symmetr
ic_full_range, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8

[I] [OPT] [00:59:38]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
[I] [OPT] [00:59:38]: IR loaded.
Building graph: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1604.55it/s]
[I] [OPT] [00:59:38]: Begin to load weights.
[I] [OPT] [00:59:38]: Weights loaded.
Deserializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4282.09it/s]
[I] [OPT] [00:59:38]: Successfully parsed IR with python API.
[I] [OPT] [00:59:38]: init graph by forwarding one sample filled with zeros
forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2720.92it/s]
[I] [OPT] [00:59:38]: [graph_optimize_stage1] is running.
[I] [OPT] [00:59:38]: [statistic] is running.
statistic batch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:09<00:00, 1068.02it/s]
[I] [OPT] [00:59:48]: [graph_optimize_stage2] is running.
[I] [OPT] [00:59:48]: applying calibration strategy based on statistic info
calibration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9516.29it/s]
[I] [OPT] [00:59:48]: [quantize] is running.
[I] [OPT] [00:59:48]: These OPs will automatically cast dtypes to adapt to lib's dtypes' spec (may cause model accuracy loss due to corresponding spec's restriction): {'OpType.FullyConnected', 'OpType.Reshape'
, 'OpType.Input'}
quantize each layer: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2486.62it/s]
[I] [OPT] [00:59:48]: collecting per-layer similarity infomation between float graph and quanted graph by forwarding 1 sample on both of them
[I] [OPT] [00:59:48]: [graph_optimize_stage3] is running.
[I] [OPT] [00:59:48]: [serialize] is running.
[I] [OPT] [00:59:48]: check the final graph by forwarding one sample filled with zeros
forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5839.62it/s]
[I] [OPT] [00:59:48]: Begin to serialzie IR
Writing IR: 4it [00:00, 10803.10it/s]
Serializing bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11724.12it/s]
[I] [OPT] [00:59:48]: IR has been saved into /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w
[I] [OPT] [00:59:48]: Compass-Optimizer has done at [serialize] period.
[I] [OPT] [00:59:48]: [Done]cost time: 10s, and [qinfos(scale, zp, dtype)]: out: [[8.502482414245605, 0, INT8]] in: [[255.21998596191406, 0, UINT8]] [output tensors cosine]: [0.9847341069051015][output tensors
 MSE]: [4.2683868408203125]
[I] Optimizing model complete
[I] Simplifying quant model...
[I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant.txt
[I] [IRChecker] model_name: tensorflow_build
[I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
[I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant.bin size: 0x18f28
[I] Start to simplify the graph...
[I] Using fixed-point full optimization, it may take long long time ....
[I] Simplify Done.
[I] Simplify quant model Done.
[I] Building ...
[I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant_s.txt
[I] [IRChecker] model_name: tensorflow_build
[I] [IRChecker] IRChecker: All IR pass
[I] [tools.cpp : 352] BuildTool version: 6.1.3119. Build for target X2_1204MP3 PID: 1702658
[I] [tools.cpp : 372] using default profile events to profile default
[I] [tools.cpp : 834] global cwd: /tmp/89163cf320c276cdaccbb8563d21615d3e20b41dfb1c06cf6339d430d3296
[I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant_s.bin size: 0x18f28
[I] [tiling.cpp:4500] Auto tiling now, please wait ...
[I] [aipu_plugin.cpp: 344] FullyConnected(sequential/dense/MatMul_Gemm__6) uses performance-lib
[I] [aipu_plugin.cpp: 344] FullyConnected(sequential/dense_1/MatMul_Gemm__7) uses performance-lib
[I] [actg.cpp  : 471] new sgnode with actg: 0
[I] [datalayout_schedule2.cpp:1067] Layout loss: 0
[I] [datalayout_schedule2.cpp:1068] Layout scheduling ...
[I] [datalayout_schedule2.cpp:1071] The layout loss for graph tensorflow_build: 0
[I] [datalayout_schedule.cpp:1072] The graph tensorflow_build post optimized score:0
[I] [datalayout_schedule.cpp:1076] layout schedule costs: 0.064558ms
[I] [IRChecker] Start to check IR:
[I] [IRChecker] model_name: cost_model
[I] [IRChecker] IRChecker: All IR pass
[I] [load_balancer.cpp:2384] enable multicore schedule optimization for load balance strategy 0 it may degrade performance on single core targets.
[I] [load_balancer.cpp:1439] ----------------------------------------------
[I] [load_balancer.cpp:1440] Scheduler Optimization Performance Evaluation:
[I] [load_balancer.cpp:1482] level: 0 cycles: 0 utils:
[I] [load_balancer.cpp:1482] level: 1 cycles: 5245 utils: 1
[I] [load_balancer.cpp:1488] total cycles: 5245
[I] [load_balancer.cpp:1489] ----------------------------------------------
[I] [load_balancer.cpp: 142] schedule level: done
[I] [load_balancer.cpp: 145] [level 0]
[I] [load_balancer.cpp:  94] subgraph_input_1
[I] [load_balancer.cpp: 105] -*-[real]input_1
[I] [load_balancer.cpp: 149] [load] 0
[I] [load_balancer.cpp: 145] [level 1]
[I] [load_balancer.cpp:  94] subgraph_subgraph_sequential/flatten/Reshape
[I] [load_balancer.cpp: 105] -*-[real]subgraph_sequential/flatten/Reshape_sg_input_0
[I] [load_balancer.cpp: 105] -*-[real]sequential/flatten/Reshape
[I] [load_balancer.cpp:  94] -*-subgraph_sequential/dense/MatMul_Gemm__6
[I] [load_balancer.cpp: 105] -*--*-[real]sequential/dense/MatMul_Gemm__6
[I] [load_balancer.cpp:  94] -*-subgraph_sequential/dense_1/MatMul_Gemm__7
[I] [load_balancer.cpp: 105] -*--*-[real]sequential/dense_1/MatMul_Gemm__7
[I] [load_balancer.cpp: 149] [load] 5245
[I] [load_balancer.cpp: 152] schedule level: done done
[I] [soms_scheduler.cpp: 186] [EVAL] init time t1: 33 ms
[I] [soms_scheduler.cpp: 192] [EVAL] unsafe check time t2: 7 ms
[I] [soms_scheduler.cpp: 866] not found!
[I] [soms_scheduler.cpp: 891] get max in loop 0
[I] [soms_scheduler.cpp: 205] [EVAL] mem assignment time t3: 18 ms
[I] [builder.cpp:1819] [EVAL] duration time of mem allocation: 67 ms
[I] [builder.cpp:1938] The graph DDR Footprint requirement(estimation) of feature maps:
[I] [builder.cpp:1939]     Read and Write:1.03KB
[I] [builder.cpp:1080] Reduce constants memory size: 12B
[I] [builder.cpp:2411] memory statistics for this graph (tensorflow_build)
[I] [builder.cpp: 585] Total memory     :       0x0005c764 Bytes (369.848KB)
[I] [builder.cpp: 585] Text      section:       0x00004a70 Bytes ( 18.609KB)
[I] [builder.cpp: 585] RO        section:       0x00000500 Bytes (  1.250KB)
[I] [builder.cpp: 585] Desc      section:       0x00000c00 Bytes (  3.000KB)
[I] [builder.cpp: 585] Data      section:       0x00016300 Bytes ( 88.750KB)
[I] [builder.cpp: 585] BSS       section:       0x000004f4 Bytes (  1.238KB)
[I] [builder.cpp: 585] Stack            :       0x00040400 Bytes (257.000KB)
[I] [builder.cpp: 585] Workspace(BSS)   :       0x00000000 Bytes (  0.000KB)
[I] [builder.cpp:2427]
[I] [tools.cpp :1181]  -  compile time: 0.156 s
[I] [tools.cpp :1087] With GM optimization, DDR Footprint stastic(estimation):
[I] [tools.cpp :1094]     Read and Write:89.43KB
[I] [tools.cpp :1137]  -  draw graph time: 0 s
[I] [tools.cpp :1954] remove global cwd: /tmp/89163cf320c276cdaccbb8563d21615d3e20b41dfb1c06cf6339d430d3296
build success.......
Total errors: 0,  warnings: 0

到目前为止就有两个模型文件分别时 model00.onnx 和 mnist.cix 和数据集 mnist_red.npy，将这三个文件发送到“星睿O6”，接下来我们分别进行对比cpu和npu 的测试，让我们看看谁能胜出。

在 “星睿O6” 的 ai_model_hub 仓库中，参考 onnx_resnet_v1_50 目录，新建一个目录 mnist，将两个模型文件和一个数据集文件放到这个目录，然后写 cpu 和 npu 的推理脚本，内容分别为：

import os
import sys
import argparse
import numpy as np
import onnxruntime as ort

# Define the absolute path to the utils package by going up four directory levels from the current file location
_abs_path = os.path.join(os.getcwd(), "../../../../")
# Append the utils package path to the system path, making it accessible for imports
sys.path.append(_abs_path)
from utils.label.imagenet_classes import id2class
from utils.image_process import imagenet_preprocess_method1
from utils.tools import get_file_list
from time import time


def get_args():
    parser = argparse.ArgumentParser()
    # Argument for the path to the image or directory containing images
    parser.add_argument(
        "--images",
        # default="./test_data/ILSVRC2012_val_00002899.JPEG",
        default="./test_data/",
        help="path to the image file path or dir path.\
            eg. images=./test_data/ILSVRC2012_val_00002899.JPEG or \
                images=./test_data/",
    )
    # Argument for the path to the ONNX model file
    parser.add_argument(
        "--onnx_path",
        default="model00-sim.onnx",
        help="path to the model file",
    )
    parser.add_argument(
        "--benchmark",
        default=False,
        help="benchmark on ILSVRC2012 val dataset.",
    )
    parser.add_argument(
        "--sel_images",
        type=int,
        default=1000,
        help="path to the model file",
    )
    args = parser.parse_args()
    return args 
predict=[]

def main():
    args = get_args()

    waste_time=[]
    # Load the ONNX model & Get the input and output names for the model
    session = ort.InferenceSession(args.onnx_path)

    if args.benchmark:
        from utils.evaluate.imagenet_metric import ImageNet_Metric

        image_metric = ImageNet_Metric(
            model=session, model_type="onnx", sel_imgs=args.sel_images
        )
        image_metric.run(input_size=224, data_type="np")

    else:
        input_name = session.get_inputs()[0].name
        output_name = session.get_outputs()[0].name
        print(input_name, output_name)
        images_list = np.load('mnist_red.npy').astype(np.float32)
        print(type(images_list), images_list.shape, images_list.dtype)
        for image_path in images_list:
            input = image_path.reshape(1,1,28,28)
            tick=time()
            outputs = session.run([output_name], {input_name: input})[0]
            waste_time.append(time()-tick)
            predict.append(np.argmax(outputs))

    waste_time_np=np.array(waste_time)
    print(waste_time_np.mean())
    np.save("cpu.npy", np.array(predict))

if __name__ == "__main__":
    main()

npu 推理脚本：

import os
import sys
import numpy as np
import argparse
from time import time

# Define the absolute path to the utils package by going up four directory levels from the current file location
_abs_path = os.path.join(os.getcwd(), "../../../../")
# Append the utils package path to the system path, making it accessible for imports
sys.path.append(_abs_path)
from utils.label.imagenet_classes import id2class
from utils.image_process import imagenet_preprocess_method1
from utils.tools import get_file_list
from utils.NOE_Engine import EngineInfer


def get_args():
    parser = argparse.ArgumentParser()
    # Argument for the path to the image or directory containing images
    parser.add_argument(
        "--images",
        # default="./test_data/ILSVRC2012_val_00002899.JPEG",
        default="./test_data/",
        help="path to the image file path or dir path.\
            eg. images=./test_data/ILSVRC2012_val_00002899.JPEG or \
                images=./test_data/",
    )
    # Argument for the path to the cix binary model file
    parser.add_argument(
        "--model_path",
        default="./mnist.cix",
        help="path to the model file",
    )
    parser.add_argument(
        "--benchmark",
        default=False,
        help="benchmark on ILSVRC2012 val dataset.",
    )
    parser.add_argument(
        "--sel_images",
        type=int,
        default=1000,
        help="path to the model file",
    )
    args = parser.parse_args()
    return args  

predict=[]
def main():
    args = get_args()
    model = EngineInfer(args.model_path)
    images_list = np.load('mnist_red.npy').astype(np.float32)

    waste_time=[]
    if args.benchmark:
        from utils.evaluate.imagenet_metric import ImageNet_Metric

        image_metric = ImageNet_Metric(
            model=model, model_type="cix", sel_imgs=args.sel_images
        )
        image_metric.run(input_size=224, data_type="np")
    else:
        for image_path in images_list:
            input = image_path.reshape(1,1,28,28)

            tick=time()
            outputs = model.forward(input)[0]
            waste_time.append(time()-tick)
            predict.append(np.argmax(outputs))

        waste_time_np=np.array(waste_time)
        print(waste_time_np.mean())
        np.save("npu.npy", np.array(predict))
        model.clean()

if __name__ == "__main__":
    main()

我们分别看一下，cpu 和 npu 推理的平均耗时是多少：

▸ python3 inference_onnx.py
[UMD ERR] /home/alezhe02/project/Compass_Runtime_Midware_release/aipulib_build/umd/src/device/aipu/aipu.cpp:55:aipu_ll_status_t aipudrv::Aipu::init(): query capability [fail]
[ERROR][init:28][UMD].AIPU UMD API input argument(s) contain NULL pointer.
input_1 dense_1
<class 'numpy.ndarray'> (10000, 1, 28, 28) float32
3.538997173309326e-05
▸ python3 inference_npu.py
npu: noe_init_context success
npu: noe_load_graph success
Input tensor count is 1.
Output tensor count is 1.
npu: noe_create_job success
0.0003013821840286255
npu: noe_clean_job success
npu: noe_unload_graph success
npu: noe_deinit_context success

可以看到，npu 的耗时（0.0003013821840286255）比 cpu 的耗时（ 3.538997173309326e-05）还要高。因为执行完成hi后会分别胜场 cpu.npy 和 npu.npy ，即相同数据集的预测结果，我是用 imhex 工具 diff 之后发现一共有 11 次预测不一致，总的测试样本是 10000，偏差有 0.11%，基本可以忽略的。问题就是为什么这个 mnist 模型 npu 处理要比 cpu 处理还要耗时呢？有点奇怪了。

总结

根据我部署 tensorflow 模型到 “星睿O6” 的经验，有一些小小的经验：

用 conda 准备多个 python 虚拟环境，用起来比较香；
在转换 tensorflow 模型到 onnx 时，如果使用输入 tensor 的 shape 没有清晰指定，大概率会出现如下类似的错误：

明确输入 tensor 的 shape 维度可以解决这个问题。

官方模型

自己的 mnist 模型

总结

推荐阅读

目录