Red · 14 小时前

【“星睿O6”AI PC开发套件评测】+ tensorflow 模型 NPU 部署

在上一篇文章中介绍了 tensorflow 环境搭建和测试。本文主要介绍如何将模型部署到 NPU,以及简单的性能对比测试分为几个部分:

  • 测试官方 onnx_resnet_v1_50 模型部署流程,以及 NPU 、 CPU 对比测试
  • 将自己的 mnist 模型部署到 “星睿O6” NPU 以及和 CPU 的对比测试

官方模型

  1. 通过早鸟计划,下载 NeuralOne AI SDK,在 PC 机上安装 CixBuilder-6.1.3119.3-py3-none-linux_x86_64.whl。

    1. 这里我使用 conda 在台式机上创建了一个虚拟的 python3.8 环境:

      conda create --name radxa_red python=3.8
      conda activate radxa_red
    2. 在虚拟环境下安装依赖:

      pip3 install -r requirements.txt 
    3. 测试 cixbuild 版本,确认 cixbuild 可以正常运行

      ▸ cixbuild -v
      /home/red/.conda/envs/radxa_red/bin/cixbuild   version: 6.1.3119
  2. cixbuild 构建工具正确运行之后,克隆 ai_model_hub 仓库,进入到目录models/ComputeVision/Image_Classification/onnx_resnet_v1_50,根据 readme, 从 https://github.com/onnx/model... 下载 resnet50-v1-12.onnx 模型到 model 目录(需要新建)

    1. 安装 onnxsim 并对模型进行简化:onnxsim model/resnet50-v1-12.onnx model/resnet50-v1-12-sim.onnx
    2. 开始执行构建 cix 模型:cixbuild cfg/onnx_resnet_v1_50build.cfg,因为根据配置文件,我看 datasets 目录下本来就有 calib_data.npy 校准数据,构建的时候有报错 ValueError: Cannot load file containing pickled data when allow_pickle=False原因是仓库默认的 npy 不支持序列化,使用readme 中的代码,将 test_data 目录下的代码存储为 numpy 格式的二进制文件,执行的时候提示缺少 imageio,重新安装 pip3 install imageio
    3. 再次重新构建就顺利构建出来了,就可以得到 resnet_v1_50.cix 模型文件

      image-20250424220309159.png

  3. 在 PC 端分别测试原始 onnx 模型和 cix 模型

    1. 原始 onnx 模型测试:平均耗时:0.013521631558736166 秒

      image path : test_data/ILSVRC2012_val_00002899.JPEG
      rock python, rock snake, Python sebae
      image path : test_data/ILSVRC2012_val_00004704.JPEG
      plunger, plumber's helper
      image path : test_data/ILSVRC2012_val_00021564.JPEG
      coucal
      image path : test_data/ILSVRC2012_val_00024154.JPEG
      Ibizan hound, Ibizan Podenco
      image path : test_data/ILSVRC2012_val_00037133.JPEG
      ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
      image path : test_data/ILSVRC2012_val_00045790.JPEG
      Yorkshire terrier
      [0.013648509979248047, 0.012284517288208008, 0.015871047973632812, 0.011651992797851562, 0.01418757438659668, 0.013486146926879883]
      0.013521631558736166
  4. O6 上分别测试原始 onnx 模型和 npu 执行 cix 模型

    1. CPU 执行原始 onnx 模型, 平均单次推理耗时 0.14705955982208252 秒

      image path : test_data/ILSVRC2012_val_00004704.JPEG
      plunger, plumber's helper
      image path : test_data/ILSVRC2012_val_00021564.JPEG
      coucal
      image path : test_data/ILSVRC2012_val_00024154.JPEG
      Ibizan hound, Ibizan Podenco
      image path : test_data/ILSVRC2012_val_00037133.JPEG
      ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
      image path : test_data/ILSVRC2012_val_00002899.JPEG
      rock python, rock snake, Python sebae
      image path : test_data/ILSVRC2012_val_00045790.JPEG
      Yorkshire terrier
      [0.16982293128967285, 0.14194154739379883, 0.1459333896636963, 0.14214777946472168, 0.14238333702087402, 0.14012837409973145]
      0.14705955982208252
    2. NPU 执行 cix 模型,平均单次推理时间是 0.0043375492095947266 秒

      ▸ python3 inference_npu.py --images test_data --model_path resnet_v1_50.cix
      npu: noe_init_context success
      npu: noe_load_graph success
      Input tensor count is 1.
      Output tensor count is 1.
      npu: noe_create_job success
      image path : test_data/ILSVRC2012_val_00004704.JPEG
      plunger, plumber's helper
      image path : test_data/ILSVRC2012_val_00021564.JPEG
      coucal
      image path : test_data/ILSVRC2012_val_00024154.JPEG
      Ibizan hound, Ibizan Podenco
      image path : test_data/ILSVRC2012_val_00037133.JPEG
      ice bear, polar bear, Ursus Maritimus, Thalarctos maritimus
      image path : test_data/ILSVRC2012_val_00002899.JPEG
      rock python, rock snake, Python sebae
      image path : test_data/ILSVRC2012_val_00045790.JPEG
      Yorkshire terrier
      [0.00441288948059082, 0.004355669021606445, 0.004000186920166016, 0.0040247440338134766, 0.004605293273925781, 0.00462651252746582]
      0.0043375492095947266
      npu: noe_clean_job success
      npu: noe_unload_graph success
      npu: noe_deinit_context success

自己的 mnist 模型

有了官方模型的适配基础,对整个部署流程就有了一个基本的了解,那么怎么适配自己的模型呢,这部分就记录下我的探索之路。

  1. 根据官方的文档《NPU SDK User Guide》可以知道,是支持 TensorFlow version 1.0–2.6 模型的,但是 CixBuilder-6.1.3119.3-py3-none-linux_x86_64 构建工具依赖的却是 tensorflow 2.7.0,我尝试直接编译 tensorflow 模型,一直报错,检索了 ai_model_hub 中的内容,发现也没有直接构建 tensorflow 模型例子,无奈之下,我尝试使用 tf2onnx 将 tensorflow 模型转化为 onnx 格式再进行构建,后面证明这个路子是行的通的,下面开始记录下,我是如何将我写的 tensorflow mnist 模型部署到 “星睿O6” 的 npu 上执行的。
  2. 首先根据 tf2onnx 仓库描述支持 TensorFlow 2.9-2.15 以及 Python 3.7-3.10,于是我使用 conda 又创建了 python 3.8 的虚拟环境(PS:至此已经两个 python 虚拟环境了, cixbuild 是一个 python3.8 tensorflow2.7 的虚拟环境,用来构建 cix 模型;python3.8 tensorflow2.9 用来转换 tensorflow 模型到 onnx 格式)
  3. 在新建的 conda 虚拟环境中,使用如下代码,将之前构建的 mnist 模型进行转化:

    import tensorflow as tf
    import tf2onnx
    
    # Load your TensorFlow Keras model or SavedModel
    model = tf.keras.models.load_model('model') # Or tf.saved_model.load('your_saved_model')
    
    # Define the input signature
    # 明确输入的 shape,根据我的实测,如果设置 shape=(None,1,28,28) 会导致转换到 onnx 模型时,出现
    # layer_top_shape=[[0,1,28,28]] 就会报错,所以这里限定 shape
    input_signature = [tf.TensorSpec(shape=(1, 1, 28, 28), dtype=tf.float32, name='input_1')]
    
    # Convert the model to ONNX
    onnx_model, _ = tf2onnx.convert.from_keras(model, input_signature=input_signature, opset=15)
    
    # Save the ONNX model
    with open("model00.onnx", "wb") as f:
     f.write(onnx_model.SerializeToString())

    通过上述代码,就可以将 model 目录下的 tensorflow saved model 格式模型转换为 model00.onnx 文件。转换到 onnx 模型之后,查看下输入、输出节点,以及他们的名字,这里使用下面的代码:

    #!/home/Red/.conda/envs/radxa_red/bin/python3
    
    import onnx
    import sys
    from os.path import isfile
    
    # Load the ONNX model
    model_path = "your_model.onnx"  # Replace with the actual path to your ONNX model
    
    if len(sys.argv) > 1:
     print(sys.argv[1])
     model_path=sys.argv[1]
    
    if isfile(model_path):
     onnx_model = onnx.load(model_path)
    else:
     print("no such file ", model_path)
     exit(-1)
    
    # Get the input information
    inputs = onnx_model.graph.input
    outputs = onnx_model.graph.output
    
    for output in outputs:
     print(f"Output Name: {output.name}")
    
    print("Input shapes of the ONNX model:")
    for input in inputs:
     print(f"  Name: {input.name}")
     if input.type.HasField("tensor_type"):
         shape = [d.dim_value if d.HasField("dim_value") else d.dim_param for d in input.type.tensor_type.shape.dim]
         print(f"    Shape: {shape}")
     else:
         print("    Shape: Unknown")

    执行测试,可以得到下面的结果:

    ▸ ./get_onnx_inputshape.py ~/Samba/ai_code_pc/model00.onnx
    /home/Red/Samba/ai_code_pc/model00.onnx
    Output Name: dense_1
    Input shapes of the ONNX model:
     Name: input_1
     Shape: [1, 1, 28, 28]

    可以看出,输入 tensor 名字时 input_1,输出是 dense_1,输入 shape 是 [1,1,28,28],下一步编写 tensorflow_build.cfg 文件

  4. 参考官方模型的配置文件,这里 mnist 模型的配置文件,是这样的:

    [Common]
    mode = build
    
    [Parser]
    model_type = onnx
    model_name = tensorflow_build
    detection_postprocess =
    model_domain = image_classification
    input_model = model00.onnx
    output = dense_1
    output_dir = ./
    input_shape = [1, 1, 28, 28]
    input = flatten_input
    
    [Optimizer]
    output_dir = ./
    calibration_data = datasets/mnist_red.npy
    calibration_batch_size = 1
    dataset = numpydataset
    save_statistic_info = True
    cast_dtypes_for_lib = True
    # global_calibration = adaround[10, 10, 32, 0.01]
    
    [GBuilder]
    target = X2_1204MP3
    outputs = mnist.cix
    tiling = fps
    profile = True

    这里主要关键的一点是 calibration_data,这里在构建 tensorflow 模型时,我将部分测试输入数据保存为了 npy 格式,对应这里的数据集字段。这部分代码,从构建模型的代码中可以看到,模型代码是这样的:

    import tensorflow as tf
    from PIL import Image
    import numpy as np
    import matplotlib.pyplot as plt
    import time
    from os.path import isfile
    from os.path import isdir
    
    print("TensorFlow version:", tf.__version__)
    
    
    def save_grayscale_array_to_image(gray_array, filename="grayscale_image.png"):
        """
        Saves a 2D NumPy array (representing grayscale data) to an image file.
    
        Args:
            gray_array (numpy.ndarray): A 2D NumPy array where each element
                                         represents the intensity of a pixel (0-255).
            filename (str): The name of the file to save the image to.
                            Common formats are 'png', 'jpg', 'bmp', etc.
        """
        try:
            if gray_array.dtype != np.uint8:
                gray_array = gray_array.astype(np.uint8)
    
            img = Image.fromarray(gray_array, mode="L")
    
            img.save(filename)
            print(f"Grayscale image saved successfully as '{filename}'")
    
        except Exception as e:
            print(f"Error saving grayscale image: {e}")
    
    def plt_predict(x, y, ya):
        ax = fig.add_subplot(1, 1, 1)
        plt.savefig("predict.png", dpi=300)
    
    
    mnist = tf.keras.datasets.mnist
    
    x = np.linspace(0, 9, 10)
    
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    
    x_train, x_test = x_train / 255.0, x_test / 255.0
    
    list_x_train = []
    for item in x_train:
        a = item.reshape(1, 28, 28)
        list_x_train.append(a)
    
    np_x_train = np.array(list_x_train)
    
    list_x_test = []
    for item in x_test:
        a = item.reshape(1, 28, 28)
        list_x_test.append(a)
    
    np_x_test = np.array(list_x_test)
    
    if isdir("model/"):
        print("exist model/ dir")
        model = tf.keras.models.load_model("model/")
        print("load model from model/ dir success")
        print(model.summary())
    else:
        model = tf.keras.models.Sequential(
            [
                tf.keras.layers.Flatten(input_shape=(1, 28, 28)),
                tf.keras.layers.Dense(128, activation="relu"),
                tf.keras.layers.Dropout(0.2),
                tf.keras.layers.Dense(10),
            ]
        )
    
        loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    
        tf_callback = tf.keras.callbacks.TensorBoard(log_dir="./logs")
        model.compile(optimizer="adam", loss=loss_fn, metrics=["accuracy"])
        model.fit(np_x_train, y_train, epochs=5, callbacks=[tf_callback], verbose=1)
        model.save("model/")
        print("save to model/ dir success")
    
    model.evaluate(np_x_test, y_test, verbose=2)
    # 保存测试数据集到文件
    np.save("datasets/mnist_red.npy", np_x_test)
  5. 一切准备就绪就开始构建 cix 模型文件了,执行构建过程如下:

    ▸ cixbuild cfg/tensorflow_build.cfg
    [I] Build with version 6.1.3119
    [I] Parsing model....
    [I] [Parser]: Begin to parse onnx model tensorflow_build...
    2025-04-26 00:59:37.604835: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No su
    ch file or directory; LD_LIBRARY_PATH: /home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/cv2/../../lib64:/home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/home
    /Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
    2025-04-26 00:59:37.604858: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
    2025-04-26 00:59:38.201630: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or
     directory; LD_LIBRARY_PATH: /home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/cv2/../../lib64:/home/Red/.conda/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib/:/home/Red/.con
    da/envs/radxa_red/lib/python3.8/site-packages/AIPUBuilder/simulator-lib
    2025-04-26 00:59:38.201651: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
    2025-04-26 00:59:38.201661: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (linux): /proc/driver/nvidia/version does not exist
    [W] [Parser]: Input name (flatten_input) does not exit in node names or tensor names! Will ignore it!
    [W] [Parser]: The output name dense_1 is not a node but a tensor. However, we will use the node sequential/dense_1/MatMul_Gemm__7 as output node.
    2025-04-26 00:59:38.375478: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in pe
    rformance-critical operations:  AVX2 FMA
    To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
    [W] [Parser]: the input node(s) has changed or not set, please check the IR to confirm the input tensors order.
    [I] [Parser]: The input tensor(s) is/are: input_1_0
    [I] [Parser]: Input flatten_input from cfg is removed!
    [I] [Parser]: Output dense_1 from cfg is shown as tensor sequential/dense_1/MatMul_Gemm__7_0 in IR!
    [I] [Parser]: 0 error(s), 3 warning(s) generated.
    [I] [Parser]: Parser done!
    [I] Parse model complete
    [I] Simplifying float model.
    [I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build.txt
    [I] [IRChecker] model_name: tensorflow_build
    [I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
    [I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build.bin size: 0x63628
    [I] Start to simplify the graph...
    [I] Using fixed-point full optimization, it may take long long time ....
    [I] Simplify Done.
    [I] Simplify float model Done.
    [I] Optimizing model....
    [I] [OPT] [00:59:38]: [arg_parser] is running.
    [I] [OPT] [00:59:38]: tool name: Compass-Optimizer, version: 1.3.3119, use cuda: False, running device: cpu
    [I] [OPT] [00:59:38]: [quantization config Info][model name]: tensorflow_build, [quantization method for weight]: per_tensor_symmetric_restricted_range, [quantization method for activation]: per_tensor_symmetr
    ic_full_range, [calibation strategy for weight]: extrema, [calibation strategy for activation]: mean, [quantization precision]: activation_bits=8, weight_bits=8, bias_bits=32, lut_items_in_bits=8
    
    [I] [OPT] [00:59:38]: Suggest using "aipuchecker" to validate the IR firstly if you are not sure about its validity.
    [I] [OPT] [00:59:38]: IR loaded.
    Building graph: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 1604.55it/s]
    [I] [OPT] [00:59:38]: Begin to load weights.
    [I] [OPT] [00:59:38]: Weights loaded.
    Deserializing bin: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 4282.09it/s]
    [I] [OPT] [00:59:38]: Successfully parsed IR with python API.
    [I] [OPT] [00:59:38]: init graph by forwarding one sample filled with zeros
    forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2720.92it/s]
    [I] [OPT] [00:59:38]: [graph_optimize_stage1] is running.
    [I] [OPT] [00:59:38]: [statistic] is running.
    statistic batch: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 10000/10000 [00:09<00:00, 1068.02it/s]
    [I] [OPT] [00:59:48]: [graph_optimize_stage2] is running.
    [I] [OPT] [00:59:48]: applying calibration strategy based on statistic info
    calibration: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 9516.29it/s]
    [I] [OPT] [00:59:48]: [quantize] is running.
    [I] [OPT] [00:59:48]: These OPs will automatically cast dtypes to adapt to lib's dtypes' spec (may cause model accuracy loss due to corresponding spec's restriction): {'OpType.FullyConnected', 'OpType.Reshape'
    , 'OpType.Input'}
    quantize each layer: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 2486.62it/s]
    [I] [OPT] [00:59:48]: collecting per-layer similarity infomation between float graph and quanted graph by forwarding 1 sample on both of them
    [I] [OPT] [00:59:48]: [graph_optimize_stage3] is running.
    [I] [OPT] [00:59:48]: [serialize] is running.
    [I] [OPT] [00:59:48]: check the final graph by forwarding one sample filled with zeros
    forward_to: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 5839.62it/s]
    [I] [OPT] [00:59:48]: Begin to serialzie IR
    Writing IR: 4it [00:00, 10803.10it/s]
    Serializing bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 11724.12it/s]
    [I] [OPT] [00:59:48]: IR has been saved into /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w
    [I] [OPT] [00:59:48]: Compass-Optimizer has done at [serialize] period.
    [I] [OPT] [00:59:48]: [Done]cost time: 10s, and [qinfos(scale, zp, dtype)]: out: [[8.502482414245605, 0, INT8]] in: [[255.21998596191406, 0, UINT8]] [output tensors cosine]: [0.9847341069051015][output tensors
     MSE]: [4.2683868408203125]
    [I] Optimizing model complete
    [I] Simplifying quant model...
    [I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant.txt
    [I] [IRChecker] model_name: tensorflow_build
    [I] [IRChecker] IRChecker: All IR pass (Checker Plugin disabled)
    [I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant.bin size: 0x18f28
    [I] Start to simplify the graph...
    [I] Using fixed-point full optimization, it may take long long time ....
    [I] Simplify Done.
    [I] Simplify quant model Done.
    [I] Building ...
    [I] [IRChecker] Start to check IR: /home/Red/Samba/ai_code_pc/internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant_s.txt
    [I] [IRChecker] model_name: tensorflow_build
    [I] [IRChecker] IRChecker: All IR pass
    [I] [tools.cpp : 352] BuildTool version: 6.1.3119. Build for target X2_1204MP3 PID: 1702658
    [I] [tools.cpp : 372] using default profile events to profile default
    [I] [tools.cpp : 834] global cwd: /tmp/89163cf320c276cdaccbb8563d21615d3e20b41dfb1c06cf6339d430d3296
    [I] [graph.cpp :1600] loading graph weight: /home/Red/Samba/ai_code_pc/./internal_2025_4_26_0_59_37_y1j6w/tensorflow_build_quant_s.bin size: 0x18f28
    [I] [tiling.cpp:4500] Auto tiling now, please wait ...
    [I] [aipu_plugin.cpp: 344] FullyConnected(sequential/dense/MatMul_Gemm__6) uses performance-lib
    [I] [aipu_plugin.cpp: 344] FullyConnected(sequential/dense_1/MatMul_Gemm__7) uses performance-lib
    [I] [actg.cpp  : 471] new sgnode with actg: 0
    [I] [datalayout_schedule2.cpp:1067] Layout loss: 0
    [I] [datalayout_schedule2.cpp:1068] Layout scheduling ...
    [I] [datalayout_schedule2.cpp:1071] The layout loss for graph tensorflow_build: 0
    [I] [datalayout_schedule.cpp:1072] The graph tensorflow_build post optimized score:0
    [I] [datalayout_schedule.cpp:1076] layout schedule costs: 0.064558ms
    [I] [IRChecker] Start to check IR:
    [I] [IRChecker] model_name: cost_model
    [I] [IRChecker] IRChecker: All IR pass
    [I] [load_balancer.cpp:2384] enable multicore schedule optimization for load balance strategy 0 it may degrade performance on single core targets.
    [I] [load_balancer.cpp:1439] ----------------------------------------------
    [I] [load_balancer.cpp:1440] Scheduler Optimization Performance Evaluation:
    [I] [load_balancer.cpp:1482] level: 0 cycles: 0 utils:
    [I] [load_balancer.cpp:1482] level: 1 cycles: 5245 utils: 1
    [I] [load_balancer.cpp:1488] total cycles: 5245
    [I] [load_balancer.cpp:1489] ----------------------------------------------
    [I] [load_balancer.cpp: 142] schedule level: done
    [I] [load_balancer.cpp: 145] [level 0]
    [I] [load_balancer.cpp:  94] subgraph_input_1
    [I] [load_balancer.cpp: 105] -*-[real]input_1
    [I] [load_balancer.cpp: 149] [load] 0
    [I] [load_balancer.cpp: 145] [level 1]
    [I] [load_balancer.cpp:  94] subgraph_subgraph_sequential/flatten/Reshape
    [I] [load_balancer.cpp: 105] -*-[real]subgraph_sequential/flatten/Reshape_sg_input_0
    [I] [load_balancer.cpp: 105] -*-[real]sequential/flatten/Reshape
    [I] [load_balancer.cpp:  94] -*-subgraph_sequential/dense/MatMul_Gemm__6
    [I] [load_balancer.cpp: 105] -*--*-[real]sequential/dense/MatMul_Gemm__6
    [I] [load_balancer.cpp:  94] -*-subgraph_sequential/dense_1/MatMul_Gemm__7
    [I] [load_balancer.cpp: 105] -*--*-[real]sequential/dense_1/MatMul_Gemm__7
    [I] [load_balancer.cpp: 149] [load] 5245
    [I] [load_balancer.cpp: 152] schedule level: done done
    [I] [soms_scheduler.cpp: 186] [EVAL] init time t1: 33 ms
    [I] [soms_scheduler.cpp: 192] [EVAL] unsafe check time t2: 7 ms
    [I] [soms_scheduler.cpp: 866] not found!
    [I] [soms_scheduler.cpp: 891] get max in loop 0
    [I] [soms_scheduler.cpp: 205] [EVAL] mem assignment time t3: 18 ms
    [I] [builder.cpp:1819] [EVAL] duration time of mem allocation: 67 ms
    [I] [builder.cpp:1938] The graph DDR Footprint requirement(estimation) of feature maps:
    [I] [builder.cpp:1939]     Read and Write:1.03KB
    [I] [builder.cpp:1080] Reduce constants memory size: 12B
    [I] [builder.cpp:2411] memory statistics for this graph (tensorflow_build)
    [I] [builder.cpp: 585] Total memory     :       0x0005c764 Bytes (369.848KB)
    [I] [builder.cpp: 585] Text      section:       0x00004a70 Bytes ( 18.609KB)
    [I] [builder.cpp: 585] RO        section:       0x00000500 Bytes (  1.250KB)
    [I] [builder.cpp: 585] Desc      section:       0x00000c00 Bytes (  3.000KB)
    [I] [builder.cpp: 585] Data      section:       0x00016300 Bytes ( 88.750KB)
    [I] [builder.cpp: 585] BSS       section:       0x000004f4 Bytes (  1.238KB)
    [I] [builder.cpp: 585] Stack            :       0x00040400 Bytes (257.000KB)
    [I] [builder.cpp: 585] Workspace(BSS)   :       0x00000000 Bytes (  0.000KB)
    [I] [builder.cpp:2427]
    [I] [tools.cpp :1181]  -  compile time: 0.156 s
    [I] [tools.cpp :1087] With GM optimization, DDR Footprint stastic(estimation):
    [I] [tools.cpp :1094]     Read and Write:89.43KB
    [I] [tools.cpp :1137]  -  draw graph time: 0 s
    [I] [tools.cpp :1954] remove global cwd: /tmp/89163cf320c276cdaccbb8563d21615d3e20b41dfb1c06cf6339d430d3296
    build success.......
    Total errors: 0,  warnings: 0

    到目前为止就有两个模型文件分别时 model00.onnx 和 mnist.cix 和数据集 mnist_red.npy,将这三个文件发送到“星睿O6”,接下来我们分别进行对比cpu和npu 的测试,让我们看看谁能胜出。

  6. 在 “星睿O6” 的 ai_model_hub 仓库中,参考 onnx_resnet_v1_50 目录,新建一个目录 mnist,将两个模型文件和一个数据集文件放到这个目录,然后写 cpu 和 npu 的推理脚本,内容分别为:

    import os
    import sys
    import argparse
    import numpy as np
    import onnxruntime as ort
    
    # Define the absolute path to the utils package by going up four directory levels from the current file location
    _abs_path = os.path.join(os.getcwd(), "../../../../")
    # Append the utils package path to the system path, making it accessible for imports
    sys.path.append(_abs_path)
    from utils.label.imagenet_classes import id2class
    from utils.image_process import imagenet_preprocess_method1
    from utils.tools import get_file_list
    from time import time
    
    
    def get_args():
        parser = argparse.ArgumentParser()
        # Argument for the path to the image or directory containing images
        parser.add_argument(
            "--images",
            # default="./test_data/ILSVRC2012_val_00002899.JPEG",
            default="./test_data/",
            help="path to the image file path or dir path.\
                eg. images=./test_data/ILSVRC2012_val_00002899.JPEG or \
                    images=./test_data/",
        )
        # Argument for the path to the ONNX model file
        parser.add_argument(
            "--onnx_path",
            default="model00-sim.onnx",
            help="path to the model file",
        )
        parser.add_argument(
            "--benchmark",
            default=False,
            help="benchmark on ILSVRC2012 val dataset.",
        )
        parser.add_argument(
            "--sel_images",
            type=int,
            default=1000,
            help="path to the model file",
        )
        args = parser.parse_args()
        return args 
    predict=[]
    
    def main():
        args = get_args()
    
        waste_time=[]
        # Load the ONNX model & Get the input and output names for the model
        session = ort.InferenceSession(args.onnx_path)
    
        if args.benchmark:
            from utils.evaluate.imagenet_metric import ImageNet_Metric
    
            image_metric = ImageNet_Metric(
                model=session, model_type="onnx", sel_imgs=args.sel_images
            )
            image_metric.run(input_size=224, data_type="np")
    
        else:
            input_name = session.get_inputs()[0].name
            output_name = session.get_outputs()[0].name
            print(input_name, output_name)
            images_list = np.load('mnist_red.npy').astype(np.float32)
            print(type(images_list), images_list.shape, images_list.dtype)
            for image_path in images_list:
                input = image_path.reshape(1,1,28,28)
                tick=time()
                outputs = session.run([output_name], {input_name: input})[0]
                waste_time.append(time()-tick)
                predict.append(np.argmax(outputs))
    
        waste_time_np=np.array(waste_time)
        print(waste_time_np.mean())
        np.save("cpu.npy", np.array(predict))
    
    if __name__ == "__main__":
        main()

    npu 推理脚本:

    import os
    import sys
    import numpy as np
    import argparse
    from time import time
    
    # Define the absolute path to the utils package by going up four directory levels from the current file location
    _abs_path = os.path.join(os.getcwd(), "../../../../")
    # Append the utils package path to the system path, making it accessible for imports
    sys.path.append(_abs_path)
    from utils.label.imagenet_classes import id2class
    from utils.image_process import imagenet_preprocess_method1
    from utils.tools import get_file_list
    from utils.NOE_Engine import EngineInfer
    
    
    def get_args():
        parser = argparse.ArgumentParser()
        # Argument for the path to the image or directory containing images
        parser.add_argument(
            "--images",
            # default="./test_data/ILSVRC2012_val_00002899.JPEG",
            default="./test_data/",
            help="path to the image file path or dir path.\
                eg. images=./test_data/ILSVRC2012_val_00002899.JPEG or \
                    images=./test_data/",
        )
        # Argument for the path to the cix binary model file
        parser.add_argument(
            "--model_path",
            default="./mnist.cix",
            help="path to the model file",
        )
        parser.add_argument(
            "--benchmark",
            default=False,
            help="benchmark on ILSVRC2012 val dataset.",
        )
        parser.add_argument(
            "--sel_images",
            type=int,
            default=1000,
            help="path to the model file",
        )
        args = parser.parse_args()
        return args  
    
    predict=[]
    def main():
        args = get_args()
        model = EngineInfer(args.model_path)
        images_list = np.load('mnist_red.npy').astype(np.float32)
    
        waste_time=[]
        if args.benchmark:
            from utils.evaluate.imagenet_metric import ImageNet_Metric
    
            image_metric = ImageNet_Metric(
                model=model, model_type="cix", sel_imgs=args.sel_images
            )
            image_metric.run(input_size=224, data_type="np")
        else:
            for image_path in images_list:
                input = image_path.reshape(1,1,28,28)
    
                tick=time()
                outputs = model.forward(input)[0]
                waste_time.append(time()-tick)
                predict.append(np.argmax(outputs))
    
            waste_time_np=np.array(waste_time)
            print(waste_time_np.mean())
            np.save("npu.npy", np.array(predict))
            model.clean()
    
    if __name__ == "__main__":
        main()
  7. 我们分别看一下,cpu 和 npu 推理的平均耗时是多少:

    ▸ python3 inference_onnx.py
    [UMD ERR] /home/alezhe02/project/Compass_Runtime_Midware_release/aipulib_build/umd/src/device/aipu/aipu.cpp:55:aipu_ll_status_t aipudrv::Aipu::init(): query capability [fail]
    [ERROR][init:28][UMD].AIPU UMD API input argument(s) contain NULL pointer.
    input_1 dense_1
    <class 'numpy.ndarray'> (10000, 1, 28, 28) float32
    3.538997173309326e-05
    ▸ python3 inference_npu.py
    npu: noe_init_context success
    npu: noe_load_graph success
    Input tensor count is 1.
    Output tensor count is 1.
    npu: noe_create_job success
    0.0003013821840286255
    npu: noe_clean_job success
    npu: noe_unload_graph success
    npu: noe_deinit_context success

​ 可以看到,npu 的耗时(0.0003013821840286255)比 cpu 的耗时( 3.538997173309326e-05)还要高。因为执行完成hi后会分别胜场 cpu.npy 和 npu.npy ,即相同数据集的预测结果,我是用 imhex 工具 diff 之后发现一共有 11 次预测不一致,总的测试样本是 10000,偏差有 0.11%,基本可以忽略的。问题就是为什么这个 mnist 模型 npu 处理要比 cpu 处理还要耗时呢?有点奇怪了。

总结

根据我部署 tensorflow 模型到 “星睿O6” 的经验,有一些小小的经验:

  • 用 conda 准备多个 python 虚拟环境,用起来比较香;
  • 在转换 tensorflow 模型到 onnx 时,如果使用输入 tensor 的 shape 没有清晰指定,大概率会出现如下类似的错误:

image-20250426012437646.png

明确输入 tensor 的 shape 维度可以解决这个问题。

推荐阅读
关注数
2
文章数
8
目录
极术微信服务号
关注极术微信号
实时接收点赞提醒和评论通知
安谋科技学堂公众号
关注安谋科技学堂
实时获取安谋科技及 Arm 教学资源
安谋科技招聘公众号
关注安谋科技招聘
实时获取安谋科技中国职位信息