观看星球大战高清与超分辨率，TVM和OpenCL

https://github.com/henriwoodcock/star-wars-super-resolution/tree/main/blog

此博客解释了如何使用带有 OpenCL 支持的 TVM 来使用Arm Mali GPU 运行超级分辨率模型。

5月4日快乐！

在这个星球大战的日子里，让我们避免所有关于哪个版本是原版星球大战的最佳版本/剪辑的争论。有些人喜欢原版，有些人喜欢蓝光，但我们都同意的是，我们的电视和显示器越来越好！以至于我们现在可以真正看到一些旧版本在质量上的差异。

与其争论，不如我们把老版本升级到高清，分辨率更高？

这篇博文将展示如何在PyTorch中构建一个超分辨率模型，将其转换为ONNX，然后使用TVM编译并部署到OpenCL设备。

内容

您将需要
快速安装和设置
超级分辨率
电视

您将需要

任何星球大战电影的低分辨率副本（或任何其他电影，你们读谁不是星球大战迷）。

具有支持开放CL的 GPU 设备。我使用的是一个Ododid N2+它有一个Arm Mali-G52 GPU

快速安装和设置

如果你在这里建立应用程序，并开始升级你的旧星球大战副本，你可以去我的GitHub 代码仓库，并按照README.md步骤安装。

超级分辨率

超级分辨率（SR）是一种用于帮助升级图像的新方法。

传统上，有两种主要技术用于帮助升级图像。这些都是最近的邻居和插值。

最近的邻居的工作原理是用输出中最近的像素替换像素，用于放大图像，这意味着最终图像中将存在多个相同的像素。

或者，插值技术的工作方法是在像素之间插值，在升级时在两者之间产生连续的颜色。问题在于它减少了图像中产生模糊效果的任何对比度。

SR 算法是神经网络模型，通过使用一对图像作为输入和输出来训练。输入图像是低分辨率副本，输出图像是高分辨率副本。然后，神经网络学习非线性关系，以帮助它填补低分辨率和高分辨率图像之间的空白。

插值与SR算法。右边的图像边缘更清晰。

SR 算法的主要缺点之一是所需的计算能力。您越想升级图像，您就越需要计算能力。同样，使用更大的图像作为输入将需要更多的计算能力和更多的内存来存储输入和输出。

TVM

对于此模型，我们将使用 TVM 部署网络。TVM是一个开源框架，为 CPU、GPU 和机器学习加速器编译神经网络。

将推理时间与皮质-A与马里GPU上的TVM进行比较。统计数据来自TVM GitHub wiki。

TVM 允许本地和远程部署模型。这为您提供了继续改进模型的自由，而不必担心您将如何在设备上更新模型。您可以远程更新模型。

但是，真正的优势来自 AutoTVM，它根据设备上的性能生成和优化您的型号。

TVM 还支持广泛的机器学习框架，包括TensorFlow、TfLite、PyTorch 和 ONNX，因此无论您使用哪个框架，都很容易融入您的工作流程。

构建模型

我发现构建超分辨率模型的最简单方法是使用PyTorch。我将使用我的开发机器为这个部分，因为PyTorch可能需要一段时间来安装，可能没有一个轮子为您的设备。

首次安装PyTorch、ONNX和numpy

pip install torch onnx, numpy

构建模型。这是取自PyTorch的例子。

class SuperResolutionNet(nn.Module):
  def __init__(self, upscale_factor, inplace=False):
    super(SuperResolutionNet, self).__init__()

    self.relu = nn.ReLU(inplace=inplace)
    self.conv1 = nn.Conv2d(1, 64, (5, 5), (1, 1), (2, 2))
    self.conv2 = nn.Conv2d(64, 64, (3, 3), (1, 1), (1, 1))
    self.conv3 = nn.Conv2d(64, 32, (3, 3), (1, 1), (1, 1))
    self.conv4 = nn.Conv2d(32, upscale_factor ** 2, (3, 3), (1, 1), (1, 1))
    self.pixel_shuffle = nn.PixelShuffle(upscale_factor)

    self._initialize_weights()

  def forward(self, x):
    x = self.relu(self.conv1(x))
    x = self.relu(self.conv2(x))
    x = self.relu(self.conv3(x))
    x = self.pixel_shuffle(self.conv4(x))
    return x

  def _initialize_weights(self):
    init.orthogonal_(self.conv1.weight, init.calculate_gain('relu'))
    init.orthogonal_(self.conv2.weight, init.calculate_gain('relu'))
    init.orthogonal_(self.conv3.weight, init.calculate_gain('relu'))
    init.orthogonal_(self.conv4.weight)

选择您的upscale factor和预制weights中的负载

upscale_factor=3
torch_model = SuperResolutionNet(upscale_factor=upscale_factor)
# Initialize model with the pretrained weights
map_location = lambda storage, loc: storage
torch_model.load_state_dict(model_zoo.load_url(model_url, map_location=map_location))

选择输入图像大小并导出模型作为 ONNX 模型。


# set the model to inference mode
torch_model.eval()
torch_model.train(False)

video_size = (640, 360)
#data to use when forming model
x = torch.randn(batch_size, 1, video_size[0], video_size[1], requires_grad=True)
# Export the model
torch.onnx.export(torch_model,               # model being run
  x,                         # model input (or a tuple for multiple inputs)
  "device/super_resolution.onnx",   # where to save the model (can be a file or file-like object)
  export_params=True,        # store the trained parameter weights inside the model file
  opset_version=10,          # the ONNX version to export the model to
  do_constant_folding=True,  # whether to execute constant folding for optimization
  input_names = ['input'],   # the model's input names
  output_names = ['output'], # the model's output names
  )

onnx_model = onnx.load("device/super_resolution.onnx")
print("checking onnx model...")
onnx.checker.check_model(onnx_model)

安装

在运行此部分之前，如果您要使用 OpenCL，请确保在设备上安装了 OpenCL 驱动程序。

我们现在需要在我们的设备上安装TVM、所需的依赖关系和所需的Python包。TVM使用llvm来编译其模型，我们也将使用开放CL。

安装依赖性

sudo apt install -y vim \
                    gcc \
                    g++ \
                    cmake \
                    python3 \
                    python3-dev \
                    python3-pip \
                    python3-setuptools \
                    python3-opencv \
                    git \
                    wget \
                    libtinfo-dev \
                    zlib1g-dev \
                    build-essential \
                    libedit-dev \
                    libxml2-dev \
                    protobuf-compiler \
                    clang \
                    llvm \
                    libatlas-base-dev \
                    gfortran

安装 Python 封装
pip3 install --user numpy \

                decorator \
                attrs \
                tornado \
                psutil \
                xgboost \
                cloudpickle \
                onnx

安装TVM

克隆TVM软件并更新所有子模块

git clone --recursive https://github.com/apache/tvm tvm
cd tvm
git submodule init
git submodule update

创建生成目录，将您的配置添加到 cmake 配置，并最终构建 TVM 软件。

mkdir build
cp cmake/config.cmake build
cd build
sed -i "s/USE_LLVM OFF/USE_LLVM ON/" config.cmake
sed -i "s/USE_OPENCL OFF/USE_OPENCL ON/" config.cmake
cmake ../.
make -j4

最后，将TVM Python路径添加到您的文件.bashrc

cd ../
export TVM_HOME=$(pwd)/tvm
export TVM_PYTHONPATH=$TVM_HOME/python:'${PYTHONPATH}'
echo 'export PYTHONPATH='$TVM_PYTHONPATH >> ~/.bashrc

编译模型和运行推论

在本节中，我们通过创建脚本来编译 OpenCL 的模型，并在模型上运行推论。在运行此部分之前，您希望确保将模型从开发机器复制到设备。.onnx

导入所需依赖关系

from PIL import Image
import onnx
import numpy as np
import tvm
from tvm import te
import tvm.relay as relay
from pathlib import Path
import cv2
import time, sys

为 OpenCL 编制 ONNX 模型

def load_model(onnx_model_path):
  onnx_model = onnx.load(onnx_model_path.as_posix())
  target = "llvm"
  target = tvm.target.Target("opencl -device=mali", host="llvm -mtriple=aarch64-linux-gnu")
  x = np.random.randn(1, 1, 640, 360)
  input_name = "input"
  shape_dict = {input_name: x.shape}
  mod, params = relay.frontend.from_onnx(onnx_model, shape_dict)

  with tvm.transform.PassContext(opt_level=1):
    intrp = relay.build_module.create_executor("graph", mod, None, target)

  tvm_model = {"intrp": intrp, "params": params}

  return tvm_model

tvm_model_p = Path("device/super_resolution.onnx")
tvm_model = load_model(tvm_model_p)

加载输入视频

def get_source_encoding_int(video_capture):
  return int(video_capture.get(cv2.CAP_PROP_FOURCC))

def get_frameSize(video_capture, scale):
  return (int(video_capture.get(cv2.CAP_PROP_FRAME_WIDTH)*scale),
    int(video_capture.get(cv2.CAP_PROP_FRAME_HEIGHT))*scale)

def get_fps(video_capture):
  return int(video_capture.get(cv2.CAP_PROP_FPS))

video_p = Path("input_video.mp4")
video = cv2.VideoCapture(video_p.as_posix())

if not video.isOpened():
  filename = video_p.as_posix()
  raise RuntimeError('Failed to open video capture from file: ', filename)

frame_count = range(int(video.get(cv2.CAP_PROP_FRAME_COUNT)))

创建输出视频视频编写器。为此，请确保我们选择新的视频大小。

output_video_p = Path("output_video.mp4")

video_writer = cv2.VideoWriter(filename = output_video_p.as_posix(),
  fourcc = get_source_encoding_int(video), fps = get_fps(video),
  frameSize = get_frameSize(video, 3)
)

编写推理循环

def inference(intrp, params, input_array):
  dtype = "float32"
  tvm_output = intrp.evaluate()(tvm.nd.array(input_array.astype(dtype)), **params).asnumpy()

  return tvm_output

for _ in frame_count:
  intrp = tvm_model['intrp']
  params = tvm_model['params']
  # import the frame to an array
  frame_present, frame = video.read()
  if not frame_present: return frame_present

  input_frame = Image.fromarray(frame).convert("YCbCr")
  img_y, img_cb, img_cr = input_frame.split()
  img_y = np.expand_dims(np.array(img_y).transpose(1,0), 0)
  img_y = np.expand_dims(img_y, 0)
  #img_y = img_y.reshape(1, -1, img_y.shape[1], img_y.shape[0])

  out = inference(intrp, params, img_y)
  out *= 255.0
  out = out.clip(0, 255)
  out = Image.fromarray(np.uint8(out[0]), mode='L')

  out_img_cb = img_cb.resize(out.size, Image.BICUBIC)
  out_img_cr = img_cr.resize(out.size, Image.BICUBIC)
  out_img = Image.merge('YCbCr', [out, out_img_cb, out_img_cr]).convert('RGB')
  out_img = np.array(out_img)
  video_writer.write(out_img)

最后发布视频和视频编写器


  video.release()
  video_writer.release()

总结

本指南展示了如何在单板计算机（如 Odroid N2+）上部署超级分辨率模型。

我们使用 PyTorch 构建模型，然后 TVM 为硬件编译网络。这使我们能够运行超级分辨率网络等大型网络。

虽然本示例中的使用案例只是为了好玩，因为每个帧可能需要几秒钟来处理，但我希望本文向您展示了 TVM 将不同网络部署到不同设备的潜力。

内容