在上一篇文章中介绍了 tensorflow 环境搭建和测试。本文主要介绍如何将模型部署到 NPU,以及简单的性能对比测试分为几个部分:
- 测试官方 onnx_resnet_v1_50 模型部署流程,以及 NPU 、 CPU 对比测试
- 将自己的 mnist 模型部署到 “星睿O6” NPU 以及和 CPU 的对比测试
官方模型
通过早鸟计划,下载 NeuralOne AI SDK,在 PC 机上安装 CixBuilder-6.1.3119.3-py3-none-linux_x86_64.whl。
这里我使用 conda 在台式机上创建了一个虚拟的 python3.8 环境:
conda create --name radxa_red python=3.8 conda activate radxa_red
在虚拟环境下安装依赖:
pip3 install -r requirements.txt
测试 cixbuild 版本,确认 cixbuild 可以正常运行
▸ cixbuild -v /home/red/.conda/envs/radxa_red/bin/cixbuild version: 6.1.3119
cixbuild 构建工具正确运行之后,克隆 ai_model_hub 仓库,进入到目录models/ComputeVision/Image_Classification/onnx_resnet_v1_50,根据 readme, 从 https://github.com/onnx/model... 下载 resnet50-v1-12.onnx 模型到 model 目录(需要新建)
- 安装 onnxsim 并对模型进行简化:
onnxsim model/resnet50-v1-12.onnx model/resnet50-v1-12-sim.onnx
- 开始执行构建 cix 模型:
cixbuild cfg/onnx_resnet_v1_50build.cfg
,因为根据配置文件,我看 datasets 目录下本来就有calib_data.npy
校准数据,构建的时候有报错 ValueError: Cannot load file containing pickled data when allow_pickle=False原因是仓库默认的 npy 不支持序列化,使用readme 中的代码,将 test_data 目录下的代码存储为 numpy 格式的二进制文件,执行的时候提示缺少 imageio,重新安装pip3 install imageio
再次重新构建就顺利构建出来了,就可以得到 resnet_v1_50.cix 模型文件
- 安装 onnxsim 并对模型进行简化:
在 PC 端分别测试原始 onnx 模型和 cix 模型
原始 onnx 模型测试:平均耗时:0.013521631558736166 秒
image path : test_data/ILSVRC2012_val_00002899.JPEG rock python, rock snake, Python sebae image path : test_data/ILSVRC2012_val_00004704.JPEG plunger, plumber's helper image path : test_data/ILSVRC2012_val_00021564.JPEG coucal image path : test_data/ILSVRC2012_val_00024154.JPEG Ibizan hound, Ibizan Podenco image path : test_data/ILSVRC2012_val_00037133.JPEG ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus image path : test_data/ILSVRC2012_val_00045790.JPEG Yorkshire terrier [0.013648509979248047, 0.012284517288208008, 0.015871047973632812, 0.011651992797851562, 0.01418757438659668, 0.013486146926879883] 0.013521631558736166
O6 上分别测试原始 onnx 模型和 npu 执行 cix 模型
CPU 执行原始 onnx 模型, 平均单次推理耗时 0.14705955982208252 秒
image path : test_data/ILSVRC2012_val_00004704.JPEG plunger, plumber's helper image path : test_data/ILSVRC2012_val_00021564.JPEG coucal image path : test_data/ILSVRC2012_val_00024154.JPEG Ibizan hound, Ibizan Podenco image path : test_data/ILSVRC2012_val_00037133.JPEG ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus image path : test_data/ILSVRC2012_val_00002899.JPEG rock python, rock snake, Python sebae image path : test_data/ILSVRC2012_val_00045790.JPEG Yorkshire terrier [0.16982293128967285, 0.14194154739379883, 0.1459333896636963, 0.14214777946472168, 0.14238333702087402, 0.14012837409973145] 0.14705955982208252
NPU 执行 cix 模型,平均单次推理时间是 0.0043375492095947266 秒
▸ python3 inference_npu.py --images test_data --model_path resnet_v1_50.cix npu: noe_init_context success npu: noe_load_graph success Input tensor count is 1. Output tensor count is 1. npu: noe_create_job success image path : test_data/ILSVRC2012_val_00004704.JPEG plunger, plumber's helper image path : test_data/ILSVRC2012_val_00021564.JPEG coucal image path : test_data/ILSVRC2012_val_00024154.JPEG Ibizan hound, Ibizan Podenco image path : test_data/ILSVRC2012_val_00037133.JPEG ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus image path : test_data/ILSVRC2012_val_00002899.JPEG rock python, rock snake, Python sebae image path : test_data/ILSVRC2012_val_00045790.JPEG Yorkshire terrier [0.00441288948059082, 0.004355669021606445, 0.004000186920166016, 0.0040247440338134766, 0.004605293273925781, 0.00462651252746582] 0.0043375492095947266 npu: noe_clean_job success npu: noe_unload_graph success npu: noe_deinit_context success
自己的 mnist 模型
有了官方模型的适配基础,对整个部署流程就有了一个基本的了解,那么怎么适配自己的模型呢,这部分就记录下我的探索之路。
- 根据官方的文档《NPU SDK User Guide》可以知道,是支持 TensorFlow version 1.0–2.6 模型的,但是 CixBuilder-6.1.3119.3-py3-none-linux_x86_64 构建工具依赖的却是 tensorflow 2.7.0,我尝试直接编译 tensorflow 模型,一直报错,检索了 ai_model_hub 中的内容,发现也没有直接构建 tensorflow 模型例子,无奈之下,我尝试使用 tf2onnx 将 tensorflow 模型转化为 onnx 格式再进行构建,后面证明这个路子是行的通的,下面开始记录下,我是如何将我写的 tensorflow mnist 模型部署到 “星睿O6” 的 npu 上执行的。
- 首先根据 tf2onnx 仓库描述支持
TensorFlow 2.9-2.15
以及Python 3.7-3.10
,于是我使用 conda 又创建了 python 3.8 的虚拟环境(PS:至此已经两个 python 虚拟环境了, cixbuild 是一个 python3.8 tensorflow2.7 的虚拟环境,用来构建 cix 模型;python3.8 tensorflow2.9 用来转换 tensorflow 模型到 onnx 格式) 在新建的 conda 虚拟环境中,使用如下代码,将之前构建的 mnist 模型进行转化:
import tensorflow as tf import tf2onnx # Load your TensorFlow Keras model or SavedModel model = tf.keras.models.load_model('model') # Or tf.saved_model.load('your_saved_model') # Define the input signature # 明确输入的 shape,根据我的实测,如果设置 shape=(None,1,28,28) 会导致转换到 onnx 模型时,出现 # layer_top_shape=[[0,1,28,28]] 就会报错,所以这里限定 shape input_signature = [tf.TensorSpec(shape=(1, 1, 28, 28), dtype=tf.float32, name='input_1')] # Convert the model to ONNX onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=input_signature, opset=15) # Save the ONNX model with open("model00.onnx", "wb") as f: f.write(onnx_model.SerializeToString())
通过上述代码,就可以将 model 目录下的 tensorflow saved model 格式模型转换为 model00.onnx 文件。转换到 onnx 模型之后,查看下输入、输出节点,以及他们的名字,这里使用下面的代码:
#!/home/Red/.conda/envs/radxa_red/bin/python3 import onnx import sys from os.path import isfile # Load the ONNX model model_path = "your_model.onnx" # Replace with the actual path to your ONNX model if len(sys.argv) > 1: print(sys.argv[1]) model_path=sys.argv[1] if isfile(model_path): onnx_model = onnx.load(model_path) else: print("no such file ", model_path) exit(-1) # Get the input information inputs = onnx_model.graph.input outputs = onnx_model.graph.output for output in outputs: print(f"Output Name: {output.name}") print("Input shapes of the ONNX model:") for input in inputs: print(f" Name: {input.name}") if input.type.HasField("tensor_type"): shape = [d.dim_value if d.HasField("dim_value") else d.dim_param for d in input.type.tensor_type.shape.dim] print(f" Shape: {shape}") else: print(" Shape: Unknown")
执行测试,可以得到下面的结果:
▸ ./get_onnx_inputshape.py ~/Samba/ai_code_pc/model00.onnx /home/Red/Samba/ai_code_pc/model00.onnx Output Name: dense_1 Input shapes of the ONNX model: Name: input_1 Shape: [1, 1, 28, 28]
可以看出,输入 tensor 名字时 input_1,输出是 dense_1,输入 shape 是 [1,1,28,28],下一步编写 tensorflow_build.cfg 文件
参考官方模型的配置文件,这里 mnist 模型的配置文件,是这样的:
[Common] mode = build [Parser] model_type = onnx model_name = tensorflow_build detection_postprocess = model_domain = image_classification input_model = model00.onnx output = dense_1 output_dir = ./ input_shape = [1, 1, 28, 28] input = flatten_input [Optimizer] output_dir = ./ calibration_data = datasets/mnist_red.npy calibration_batch_size = 1 dataset = numpydataset save_statistic_info = True cast_dtypes_for_lib = True # global_calibration = adaround[10, 10, 32, 0.01] [GBuilder] target = X2_1204MP3 outputs = mnist.cix tiling = fps profile = True
这里主要关键的一点是 calibration_data,这里在构建 tensorflow 模型时,我将部分测试输入数据保存为了 npy 格式,对应这里的数据集字段。这部分代码,从构建模型的代码中可以看到,模型代码是这样的:
import tensorflow as tf from PIL import Image import numpy as np import matplotlib.pyplot as plt import time from os.path import isfile from os.path import isdir print("TensorFlow version:", tf.__version__) def save_grayscale_array_to_image(gray_array, filename="grayscale_image.png"): """ Saves a 2D NumPy array (representing grayscale data) to an image file. Args: gray_array (numpy.ndarray): A 2D NumPy array where each element represents the intensity of a pixel (0-255). filename (str): The name of the file to save the image to. Common formats are 'png', 'jpg', 'bmp', etc. """ try: if gray_array.dtype != np.uint8: gray_array = gray_array.astype(np.uint8) img = Image.fromarray(gray_array, mode="L") img.save(filename) print(f"Grayscale image saved successfully as '{filename}'") except Exception as e: print(f"Error saving grayscale image: {e}") def plt_predict(x, y, ya): ax = fig.add_subplot(1, 1, 1) plt.savefig("predict.png", dpi=300) mnist = tf.keras.datasets.mnist x = np.linspace(0, 9, 10) (x_train, y_train), (x_test, y_test) = mnist.load_data() x_train, x_test = x_train / 255.0, x_test / 255.0 list_x_train = [] for item in x_train: a = item.reshape(1, 28, 28) list_x_train.append(a) np_x_train = np.array(list_x_train) list_x_test = [] for item in x_test: a = item.reshape(1, 28, 28) list_x_test.append(a) np_x_test = np.array(list_x_test) if isdir("model/"): print("exist model/ dir") model = tf.keras.models.load_model("model/") print("load model from model/ dir success") print(model.summary()) else: model = tf.keras.models.Sequential( [ tf.keras.layers.Flatten(input_shape=(1, 28, 28)), tf.keras.layers.Dense(128, activation="relu"), tf.keras.layers.Dropout(0.2), tf.keras.layers.Dense(10), ] ) loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True) tf_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs") model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"]) model.fit(np_x_train, y_train, epochs=5, callbacks=[tf_callback], verbose=1) model.save("model/") print("save to model/ dir success") model.evaluate(np_x_test, y_test, verbose=2) # 保存测试数据集到文件 np.save("datasets/mnist_red.npy", np_x_test)
一切准备就绪就开始构建 cix 模型文件了,执行构建过程如下:
▸ cixbuild cfg/tensorflow_build.cfg [I] Build with version 6.1.3119 [I] Parsing model.... [I] [Parser]: Begin to parse onnx model tensorflow_build... 2025-04-26 00:59:37.604835: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No su ch file or directory; LD_LIBRARY_PATH: /home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/cv2/../../lib64:/home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/home /Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib 2025-04-26 00:59:37.604858: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine. 2025-04-26 00:59:38.201630: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/cv2/../../lib64:/home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/home/Red/.con da/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib 2025-04-26 00:59:38.201651: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303) 2025-04-26 00:59:38.201661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (linux): /proc/driver/nvidia/version does not exist [W] [Parser]: Input name (flatten_input) does not exit in node names or tensor names! Will ignore it! [W] [Parser]: The output name dense_1 is not a node but a tensor. However, we will use the node sequential/dense_1/MatMul_Gemm__7 as output node. 2025-04-26 00:59:38.375478: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in pe rformance-critical operations: AVX2 FMA To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. [W] [Parser]: the input node(s) has changed or not set, please check the IR to confirm the input tensors order. [I] [Parser]: The input tensor(s) is/are: input_1_0 [I] [Parser]: Input flatten_input from cfg is removed! [I] [Parser]: Output dense_1 from cfg is shown as tensor sequential/dense_1/MatMul_Gemm__7_0 in IR! [I] [Parser]: 0 error(s), 3 warning(s) generated. [I] [Parser]: Parser done! [I] Parse model complete [I] Simplifying float model. [I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build.txt [I] [IRChecker] model_name: tensorflow_build [I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled) [I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build.bin size: 0x63628 [I] Start to simplify the graph... [I] Using fixed-point full optimization, it may take long long time .... [I] Simplify Done. [I] Simplify float model Done. [I] Optimizing model.... [I] [OPT] [00:59:38]: [arg_parser] is running. [I] [OPT] [00:59:38]: tool name: Compass-Optimizer, version: 1.3.3119, use cuda: False, running device: cpu [I] [OPT] [00:59:38]: [quantization config Info][model name]: tensorflow_build, [quantization method for weight]: per_tensor_symmetric_restricted_range, [quantization method for activation]: per_tensor_symmetr ic_full_range, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8 [I] [OPT] [00:59:38]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity. [I] [OPT] [00:59:38]: IR loaded. Building graph: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1604.55it/s] [I] [OPT] [00:59:38]: Begin to load weights. [I] [OPT] [00:59:38]: Weights loaded. Deserializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4282.09it/s] [I] [OPT] [00:59:38]: Successfully parsed IR with python API. [I] [OPT] [00:59:38]: init graph by forwarding one sample filled with zeros forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2720.92it/s] [I] [OPT] [00:59:38]: [graph_optimize_stage1] is running. [I] [OPT] [00:59:38]: [statistic] is running. statistic batch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:09<00:00, 1068.02it/s] [I] [OPT] [00:59:48]: [graph_optimize_stage2] is running. [I] [OPT] [00:59:48]: applying calibration strategy based on statistic info calibration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9516.29it/s] [I] [OPT] [00:59:48]: [quantize] is running. [I] [OPT] [00:59:48]: These OPs will automatically cast dtypes to adapt to lib's dtypes' spec (may cause model accuracy loss due to corresponding spec's restriction): {'OpType.FullyConnected', 'OpType.Reshape' , 'OpType.Input'} quantize each layer: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2486.62it/s] [I] [OPT] [00:59:48]: collecting per-layer similarity infomation between float graph and quanted graph by forwarding 1 sample on both of them [I] [OPT] [00:59:48]: [graph_optimize_stage3] is running. [I] [OPT] [00:59:48]: [serialize] is running. [I] [OPT] [00:59:48]: check the final graph by forwarding one sample filled with zeros forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5839.62it/s] [I] [OPT] [00:59:48]: Begin to serialzie IR Writing IR: 4it [00:00, 10803.10it/s] Serializing bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11724.12it/s] [I] [OPT] [00:59:48]: IR has been saved into /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w [I] [OPT] [00:59:48]: Compass-Optimizer has done at [serialize] period. [I] [OPT] [00:59:48]: [Done]cost time: 10s, and [qinfos(scale, zp, dtype)]: out: [[8.502482414245605, 0, INT8]] in: [[255.21998596191406, 0, UINT8]] [output tensors cosine]: [0.9847341069051015][output tensors MSE]: [4.2683868408203125] [I] Optimizing model complete [I] Simplifying quant model... [I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant.txt [I] [IRChecker] model_name: tensorflow_build [I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled) [I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant.bin size: 0x18f28 [I] Start to simplify the graph... [I] Using fixed-point full optimization, it may take long long time .... [I] Simplify Done. [I] Simplify quant model Done. [I] Building ... [I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant_s.txt [I] [IRChecker] model_name: tensorflow_build [I] [IRChecker] IRChecker: All IR pass [I] [tools.cpp : 352] BuildTool version: 6.1.3119. Build for target X2_1204MP3 PID: 1702658 [I] [tools.cpp : 372] using default profile events to profile default [I] [tools.cpp : 834] global cwd: /tmp/89163cf320c276cdaccbb8563d21615d3e20b41dfb1c06cf6339d430d3296 [I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant_s.bin size: 0x18f28 [I] [tiling.cpp:4500] Auto tiling now, please wait ... [I] [aipu_plugin.cpp: 344] FullyConnected(sequential/dense/MatMul_Gemm__6) uses performance-lib [I] [aipu_plugin.cpp: 344] FullyConnected(sequential/dense_1/MatMul_Gemm__7) uses performance-lib [I] [actg.cpp : 471] new sgnode with actg: 0 [I] [datalayout_schedule2.cpp:1067] Layout loss: 0 [I] [datalayout_schedule2.cpp:1068] Layout scheduling ... [I] [datalayout_schedule2.cpp:1071] The layout loss for graph tensorflow_build: 0 [I] [datalayout_schedule.cpp:1072] The graph tensorflow_build post optimized score:0 [I] [datalayout_schedule.cpp:1076] layout schedule costs: 0.064558ms [I] [IRChecker] Start to check IR: [I] [IRChecker] model_name: cost_model [I] [IRChecker] IRChecker: All IR pass [I] [load_balancer.cpp:2384] enable multicore schedule optimization for load balance strategy 0 it may degrade performance on single core targets. [I] [load_balancer.cpp:1439] ---------------------------------------------- [I] [load_balancer.cpp:1440] Scheduler Optimization Performance Evaluation: [I] [load_balancer.cpp:1482] level: 0 cycles: 0 utils: [I] [load_balancer.cpp:1482] level: 1 cycles: 5245 utils: 1 [I] [load_balancer.cpp:1488] total cycles: 5245 [I] [load_balancer.cpp:1489] ---------------------------------------------- [I] [load_balancer.cpp: 142] schedule level: done [I] [load_balancer.cpp: 145] [level 0] [I] [load_balancer.cpp: 94] subgraph_input_1 [I] [load_balancer.cpp: 105] -*-[real]input_1 [I] [load_balancer.cpp: 149] [load] 0 [I] [load_balancer.cpp: 145] [level 1] [I] [load_balancer.cpp: 94] subgraph_subgraph_sequential/flatten/Reshape [I] [load_balancer.cpp: 105] -*-[real]subgraph_sequential/flatten/Reshape_sg_input_0 [I] [load_balancer.cpp: 105] -*-[real]sequential/flatten/Reshape [I] [load_balancer.cpp: 94] -*-subgraph_sequential/dense/MatMul_Gemm__6 [I] [load_balancer.cpp: 105] -*--*-[real]sequential/dense/MatMul_Gemm__6 [I] [load_balancer.cpp: 94] -*-subgraph_sequential/dense_1/MatMul_Gemm__7 [I] [load_balancer.cpp: 105] -*--*-[real]sequential/dense_1/MatMul_Gemm__7 [I] [load_balancer.cpp: 149] [load] 5245 [I] [load_balancer.cpp: 152] schedule level: done done [I] [soms_scheduler.cpp: 186] [EVAL] init time t1: 33 ms [I] [soms_scheduler.cpp: 192] [EVAL] unsafe check time t2: 7 ms [I] [soms_scheduler.cpp: 866] not found! [I] [soms_scheduler.cpp: 891] get max in loop 0 [I] [soms_scheduler.cpp: 205] [EVAL] mem assignment time t3: 18 ms [I] [builder.cpp:1819] [EVAL] duration time of mem allocation: 67 ms [I] [builder.cpp:1938] The graph DDR Footprint requirement(estimation) of feature maps: [I] [builder.cpp:1939] Read and Write:1.03KB [I] [builder.cpp:1080] Reduce constants memory size: 12B [I] [builder.cpp:2411] memory statistics for this graph (tensorflow_build) [I] [builder.cpp: 585] Total memory : 0x0005c764 Bytes (369.848KB) [I] [builder.cpp: 585] Text section: 0x00004a70 Bytes ( 18.609KB) [I] [builder.cpp: 585] RO section: 0x00000500 Bytes ( 1.250KB) [I] [builder.cpp: 585] Desc section: 0x00000c00 Bytes ( 3.000KB) [I] [builder.cpp: 585] Data section: 0x00016300 Bytes ( 88.750KB) [I] [builder.cpp: 585] BSS section: 0x000004f4 Bytes ( 1.238KB) [I] [builder.cpp: 585] Stack : 0x00040400 Bytes (257.000KB) [I] [builder.cpp: 585] Workspace(BSS) : 0x00000000 Bytes ( 0.000KB) [I] [builder.cpp:2427] [I] [tools.cpp :1181] - compile time: 0.156 s [I] [tools.cpp :1087] With GM optimization, DDR Footprint stastic(estimation): [I] [tools.cpp :1094] Read and Write:89.43KB [I] [tools.cpp :1137] - draw graph time: 0 s [I] [tools.cpp :1954] remove global cwd: /tmp/89163cf320c276cdaccbb8563d21615d3e20b41dfb1c06cf6339d430d3296 build success....... Total errors: 0, warnings: 0
到目前为止就有两个模型文件分别时 model00.onnx 和 mnist.cix 和数据集 mnist_red.npy,将这三个文件发送到“星睿O6”,接下来我们分别进行对比cpu和npu 的测试,让我们看看谁能胜出。
在 “星睿O6” 的 ai_model_hub 仓库中,参考 onnx_resnet_v1_50 目录,新建一个目录 mnist,将两个模型文件和一个数据集文件放到这个目录,然后写 cpu 和 npu 的推理脚本,内容分别为:
import os import sys import argparse import numpy as np import onnxruntime as ort # Define the absolute path to the utils package by going up four directory levels from the current file location _abs_path = os.path.join(os.getcwd(), "../../../../") # Append the utils package path to the system path, making it accessible for imports sys.path.append(_abs_path) from utils.label.imagenet_classes import id2class from utils.image_process import imagenet_preprocess_method1 from utils.tools import get_file_list from time import time def get_args(): parser = argparse.ArgumentParser() # Argument for the path to the image or directory containing images parser.add_argument( "--images", # default="./test_data/ILSVRC2012_val_00002899.JPEG", default="./test_data/", help="path to the image file path or dir path.\ eg. images=./test_data/ILSVRC2012_val_00002899.JPEG or \ images=./test_data/", ) # Argument for the path to the ONNX model file parser.add_argument( "--onnx_path", default="model00-sim.onnx", help="path to the model file", ) parser.add_argument( "--benchmark", default=False, help="benchmark on ILSVRC2012 val dataset.", ) parser.add_argument( "--sel_images", type=int, default=1000, help="path to the model file", ) args = parser.parse_args() return args predict=[] def main(): args = get_args() waste_time=[] # Load the ONNX model & Get the input and output names for the model session = ort.InferenceSession(args.onnx_path) if args.benchmark: from utils.evaluate.imagenet_metric import ImageNet_Metric image_metric = ImageNet_Metric( model=session, model_type="onnx", sel_imgs=args.sel_images ) image_metric.run(input_size=224, data_type="np") else: input_name = session.get_inputs()[0].name output_name = session.get_outputs()[0].name print(input_name, output_name) images_list = np.load('mnist_red.npy').astype(np.float32) print(type(images_list), images_list.shape, images_list.dtype) for image_path in images_list: input = image_path.reshape(1,1,28,28) tick=time() outputs = session.run([output_name], {input_name: input})[0] waste_time.append(time()-tick) predict.append(np.argmax(outputs)) waste_time_np=np.array(waste_time) print(waste_time_np.mean()) np.save("cpu.npy", np.array(predict)) if __name__ == "__main__": main()
npu 推理脚本:
import os import sys import numpy as np import argparse from time import time # Define the absolute path to the utils package by going up four directory levels from the current file location _abs_path = os.path.join(os.getcwd(), "../../../../") # Append the utils package path to the system path, making it accessible for imports sys.path.append(_abs_path) from utils.label.imagenet_classes import id2class from utils.image_process import imagenet_preprocess_method1 from utils.tools import get_file_list from utils.NOE_Engine import EngineInfer def get_args(): parser = argparse.ArgumentParser() # Argument for the path to the image or directory containing images parser.add_argument( "--images", # default="./test_data/ILSVRC2012_val_00002899.JPEG", default="./test_data/", help="path to the image file path or dir path.\ eg. images=./test_data/ILSVRC2012_val_00002899.JPEG or \ images=./test_data/", ) # Argument for the path to the cix binary model file parser.add_argument( "--model_path", default="./mnist.cix", help="path to the model file", ) parser.add_argument( "--benchmark", default=False, help="benchmark on ILSVRC2012 val dataset.", ) parser.add_argument( "--sel_images", type=int, default=1000, help="path to the model file", ) args = parser.parse_args() return args predict=[] def main(): args = get_args() model = EngineInfer(args.model_path) images_list = np.load('mnist_red.npy').astype(np.float32) waste_time=[] if args.benchmark: from utils.evaluate.imagenet_metric import ImageNet_Metric image_metric = ImageNet_Metric( model=model, model_type="cix", sel_imgs=args.sel_images ) image_metric.run(input_size=224, data_type="np") else: for image_path in images_list: input = image_path.reshape(1,1,28,28) tick=time() outputs = model.forward(input)[0] waste_time.append(time()-tick) predict.append(np.argmax(outputs)) waste_time_np=np.array(waste_time) print(waste_time_np.mean()) np.save("npu.npy", np.array(predict)) model.clean() if __name__ == "__main__": main()
我们分别看一下,cpu 和 npu 推理的平均耗时是多少:
▸ python3 inference_onnx.py [UMD ERR] /home/alezhe02/project/Compass_Runtime_Midware_release/aipulib_build/umd/src/device/aipu/aipu.cpp:55:aipu_ll_status_t aipudrv::Aipu::init(): query capability [fail] [ERROR][init:28][UMD].AIPU UMD API input argument(s) contain NULL pointer. input_1 dense_1 <class 'numpy.ndarray'> (10000, 1, 28, 28) float32 3.538997173309326e-05 ▸ python3 inference_npu.py npu: noe_init_context success npu: noe_load_graph success Input tensor count is 1. Output tensor count is 1. npu: noe_create_job success 0.0003013821840286255 npu: noe_clean_job success npu: noe_unload_graph success npu: noe_deinit_context success
可以看到,npu 的耗时(0.0003013821840286255)比 cpu 的耗时( 3.538997173309326e-05)还要高。因为执行完成hi后会分别胜场 cpu.npy 和 npu.npy ,即相同数据集的预测结果,我是用 imhex 工具 diff 之后发现一共有 11 次预测不一致,总的测试样本是 10000,偏差有 0.11%,基本可以忽略的。问题就是为什么这个 mnist 模型 npu 处理要比 cpu 处理还要耗时呢?有点奇怪了。
总结
根据我部署 tensorflow 模型到 “星睿O6” 的经验,有一些小小的经验:
- 用 conda 准备多个 python 虚拟环境,用起来比较香;
- 在转换 tensorflow 模型到 onnx 时,如果使用输入 tensor 的 shape 没有清晰指定,大概率会出现如下类似的错误:
明确输入 tensor 的 shape 维度可以解决这个问题。