如何DIY轻型的Mobilenet-SSD模型

物体检测技术是解决许多业务问题的关键性技术，如ADAS中的FCW（前碰预警）中，需要借助物体检测技术检测和识别前方的车辆和行人；又如人脸识别闸机，需要借助人脸检测器检测需要过闸机的人员，并把人脸ROI返回给人脸识别模块进行验证等等。本专栏的文章和内容主要面向嵌入式端的AI算法，所以本文主要跟大家讨论如何去DIY能在嵌入式设备上近实时运行的Mobilenet-SSD。
首发：https://zhuanlan.zhihu.com/p/71148972
作者：张新栋

这里我们采用的是google开源的Object detection API，安装过程大家可以参考文档，这里不再进行赘述。我们在使用Object detection API训练自己的物体检测器的时候，一般需要先经过如下几个步骤：训练数据（tfrecord）的准备、预处理函数设置、模型选择、训练参数配置，在完成以上步骤后，我们就可以按照官方提供的训练步骤进行训练。下面我们来跟大家聊一聊，如何去进行模型DIY的准备。

数据准备

这个一般根据大家自己的标注数据，可以是任何格式，一般的形式为一张图片内对应不同类别的bounding boxes。大家在生成tfrecord的时候，只需要对应Object detection api提供tfrecord格式。如下为简单的实例代码，你可以跟进自己的标注文件格式进行自定义的解析。

def create_tf_example(example):
    # TODO(user): Populate the following variables from your example.
    height = example['height'] # Image height
    width = example['width'] # Image width
    filename = example['filename'] # Filename of the image. Empty if image is not from file
    encoded_image_data = example['image'] # Encoded image bytes
    image_format = example['format'] # b'jpeg' or b'png'

    xmins = example['xmin'] # List of normalized left x coordinates in bounding box (1 per box)
    xmaxs = example['xmax'] # List of normalized right x coordinates in bounding box(1 per box)
    ymins = example['ymin'] # List of normalized top y coordinates in bounding box (1 per box)
    ymaxs = example['ymax'] # List of normalized bottom y coordinates in bounding box(1 per box)
    classes_text = example['text'] # List of string class name of bounding box (1 per box)
    classes = example['label'] # List of integer class id of bounding box (1 per box)

    tf_example = tf.train.Example(features=tf.train.Features(feature={
        'image/height': dataset_util.int64_feature(height),
        'image/width': dataset_util.int64_feature(width),
        'image/filename': dataset_util.bytes_feature(filename),
        'image/source_id': dataset_util.bytes_feature(filename),
        'image/encoded': dataset_util.bytes_feature(encoded_image_data),
        'image/format': dataset_util.bytes_feature(image_format),
        'image/object/bbox/xmin': dataset_util.float_list_feature(xmins),
        'image/object/bbox/xmax': dataset_util.float_list_feature(xmaxs),
        'image/object/bbox/ymin': dataset_util.float_list_feature(ymins),
        'image/object/bbox/ymax': dataset_util.float_list_feature(ymaxs),
        'image/object/class/text': dataset_util.bytes_list_feature(classes_text),
        'image/object/class/label': dataset_util.int64_list_feature(classes),
    }))
    return tf_example

所以你需要根据自己的标准格式，进行解析，然后逐个的装配近python dict中的image/height、image/width、image/filename等等。其中需要注意的是，xmin、xmax、ymin、ymax为数组，且其内元素的值域范围是[0, 1]，值都已经用图像的width和height进行了归一化。如下为装配tfrecord的实例代码，

import cv2
import glob
import os
import xml.etree.ElementTree as ET
import tensorflow as tf
from object_detection.utils import dataset_util
from PIL import Image
import io
import random

train_writer = tf.python_io.TFRecordWriter("./dms_train.tfrecords")
test_writer  = tf.python_io.TFRecordWriter("./dms_test.tfrecords")
examples = []


annotation_files = [
    "./annotations/2019-06-13.xml",
    "./annotations/2019-06-25.xml"
]

for annotation_file in annotation_files:
    # annotation_file = "./annotations/1468_imglab.xml";
    root = ET.parse(annotation_file).getroot() 
    image_folder = annotation_file.replace("./annotations/", "")
    image_folder = image_folder.replace(".xml", "")

    for image in root.findall("images/image"):
        filename = image.get("file")
        filename = os.path.join("./image-data/", image_folder, filename)
        boxes    = image.findall("box")

        with tf.gfile.GFile(filename, 'rb') as fid:
            encoded_jpg = fid.read()
            encoded_jpg_io = io.BytesIO(encoded_jpg)
            img  = Image.open(encoded_jpg_io)
            width, height = img.size

        xmins   = []
        ymins   = []
        xmaxs   = []
        ymaxs   = []
        texts   = []
        labels  = []
        example = {}

        for box in boxes:
            x = int(box.get("left"))   / float(width)
            y = int(box.get("top"))    / float(height)
            w = int(box.get("width"))  / float(width)
            h = int(box.get("height")) / float(height)
            label_text   = box.findall("label")[0].text
            label  = -1

            xmin = x
            ymin = y
            xmax = x + w
            ymax = y + h
            if xmin < 0.0:
                xmin = 0.0
            if ymin < 0.0:
                ymin = 0.0
            if xmax > 1.0:
                xmax = 1.0
            if ymax > 1.0:
                ymax = 1.0
  
            if "face" == label_text:
                label = 1
                labels.append(label)
                texts.append("face")
                xmins.append(xmin)
                ymins.append(ymin)
                xmaxs.append(xmax)
                ymaxs.append(ymax)
        
        if len(xmins) > 0:
            filename = filename.encode('utf8')
            example['filename'] = filename
            example['image']    = encoded_jpg
            example['format']   = b'jpg'
            example['height']   = height
            example['width']    = width
            example['xmin'] = xmins
            example['xmax'] = xmaxs
            example['ymin'] = ymins
            example['ymax'] = ymaxs
            example['label']= labels
            example['text'] = texts
            examples.append(example)

idx = 0
random.shuffle(examples)
for example in examples:
    tf_example = create_tf_example(example)
    if idx < 100:
        test_writer.write(tf_example.SerializeToString())
    else:
        train_writer.write(tf_example.SerializeToString())
    idx += 1

train_writer.close()
test_writer.close()

预处理函数设置

Object detection API本身把预处理的函数写死了，默认采用的是 value*2 /255 -1的操作，映射到[-1, 1]空间中。这里若对预处理函数有特殊需求的，例如需要采用ImageNet std-mean values进行白化的，大家可以参考如下代码，对models/ssd\_mobilenet\_v1\_feature\_extractor.py中的preprocess进行修改：

  def preprocess(self, resized_inputs):
    """SSD preprocessing.

    Maps pixel values to the range [-1, 1].

    Args:
      resized_inputs: a [batch, height, width, channels] float tensor
        representing a batch of images.

    Returns:
      preprocessed_inputs: a [batch, height, width, channels] float tensor
        representing a batch of images.
    """
    means  = tf.constant((123.00,123.00,123.00), dtype=tf.float32)
    deriv  = tf.constant((58.000,58.000,58.000), dtype=tf.float32)
    output = tf.subtract(resized_inputs, means)
    output = tf.divide(output, deriv)
    return output
    #return (2.0 / 255.0) * resized_inputs - 1.0

模型选择

模型选择其实就是选择适合你业务场景的Mobilenet-SSD模型参数，这个模型参数我们一般在模型config文件中进行配置，目前可调整模型大小的参数为输入数据的width、height，每个depthwise输出的通道控制参数depth\_multiplier，以及anchor\_generator的内部参数。例如，我们如果针对近距离人脸检测的场景，其实输入可以很小，224x224的输入尺度，以及depth\_multiplier取0.5就可以满足我们的业务需求。该模型几乎可以近实时的运行在目前的中低端嵌入式设备中，如RK3288、RK3399等；对于anchor\_generator可调整的参数就比较多了，比如你可以减少aspect\_ratios的item个数（对应减少输出的anchor的个数）、调整min\_scale和max\_scale（会影响对大小物体的敏感程度）、num\_layers（输入object detection layers的特征层数，mobilenet-ssd中默认输入6个特征层）。

训练参数配置

训练参数的配置主要影响检测器的检测效果，不影响检测器的速度；训练参数配置可选的比较多，一般都针对优化算法、batch、learning-rate、data-augmentation进行调整。特别是data-augmentation，大家在进行训练的时候，可多进行尝试，如随机水平翻转、随机图像值变化、随机裁剪等等。若合理的进行data-augmentation，可以最大化你训练数据的利用程度。如下是一个实例的配置文件，关于模型选择和训练参数配置的细节大家可以参考如下配置文件：

#SSD with Mobilenet v1, configured for traffic Dataset.
# Users should configure the fine_tune_checkpoint field in the train config as
# well as the label_map_path and input_path fields in the train_input_reader and
# eval_input_reader. Search for "PATH_TO_BE_CONFIGURED" to find the fields that
# should be configured.
# TPU-compatible

model {
  ssd {
    num_classes: 1
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        #min_scale: 0.2
        #max_scale: 0.95
        #aspect_ratios: 1.0
        #aspect_ratios: 2.0
        #aspect_ratios: 0.5
        #aspect_ratios: 3.0
        #aspect_ratios: 0.3333
        min_scale: 0.2
        max_scale: 0.95
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.3333
      }
    }
    image_resizer {
      fixed_shape_resizer {
        height: 224
        width: 224
      }
    }
    box_predictor {
      convolutional_box_predictor {
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.8
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
        conv_hyperparams {
          activation: RELU_6,
          regularizer {
            l2_regularizer {
              weight: 0.00004
            }
          }
          initializer {
            truncated_normal_initializer {
              stddev: 0.03
              mean: 0.0
            }
          }
          batch_norm {
            train: true,
            scale: true,
            center: true,
            decay: 0.9997,
            epsilon: 0.001,
          }
        }
      }
    }
    feature_extractor {
      type: 'ssd_mobilenet_v1'
      min_depth: 16
      depth_multiplier: 0.5
      conv_hyperparams {
        activation: RELU_6,
        regularizer {
          l2_regularizer {
            weight: 0.00004
          }
        }
        initializer {
          truncated_normal_initializer {
            stddev: 0.03
            mean: 0.0
          }
        }
        batch_norm {
          train: true,
          scale: true,
          center: true,
          decay: 0.9997,
          epsilon: 0.001,
        }
      }
    }
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.99
        loss_type: BOTH
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
    normalize_loss_by_num_matches: true
    post_processing {
      batch_non_max_suppression {
        score_threshold: 1e-8
        iou_threshold: 0.6
        max_detections_per_class: 50
        max_total_detections: 50
      }
      score_converter: SIGMOID
    }
  }
}

train_config: {
  batch_size: 32
  num_batch_queue_threads: 1
  batch_queue_capacity: 2000
  optimizer {
    rms_prop_optimizer: {
      learning_rate: {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.001
          decay_steps: 18750
          decay_factor: 0.5
        }
      }
      momentum_optimizer_value: 0.9
      decay: 0.9
      epsilon: 1.0
    }
  }
  #fine_tune_checkpoint: "/home/shuai/models/ssd_mobilenet_v1_0.75_depth_300x300_coco14_sync_2018_07_03/model.ckpt"
  fine_tune_checkpoint: "/home/data/zhangxd/train_face_0.5_224x224/model.ckpt-200058"
  from_detection_checkpoint: true
  load_all_detection_checkpoint_vars: false
  # Note: The below line limits the training process to 200K steps, which we
  # empirically found to be sufficient enough to train the pets dataset. This
  # effectively bypasses the learning rate schedule (the learning rate will
  # never decay). Remove the below line to train indefinitely.
  num_steps: 200000
  data_augmentation_options {
    random_adjust_brightness{
    }
  }
  data_augmentation_options {
    random_image_scale {
    }
  }
  data_augmentation_options {
    random_jitter_boxes {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  max_number_of_boxes: 50
  unpad_groundtruth_tensors: false
}

train_input_reader: {
  tf_record_input_reader {
    input_path: "/home/data/zhangxd/face_det_data/face_det_train.tfrecords"
  }
  label_map_path: "/home/data/zhangxd/face_det_data/face_label_map.pbtxt"
  num_readers: 1
  prefetch_size: 256
  read_block_length: 32

}

eval_config: {
  num_examples: 1000
  # Note: The below line limits the evaluation process to 10 evaluations.
  # Remove the below line to evaluate indefinitely.
  visualization_export_dir: '/home/data/zhangxd/visualization'
  visualize_groundtruth_boxes: True
  min_score_threshold: 0.5
  num_visualizations: 100
  include_metrics_per_category: True
}

eval_input_reader: {
  tf_record_input_reader {
    input_path: "/home/data/zhangxd/face_det_data/face_det_test.tfrecords"
  }
  label_map_path: "/home/data/zhangxd/face_det_data/face_label_map.pbtxt"
  shuffle: false
  num_readers: 1
  prefetch_size: 32
  read_block_length: 16
}

最后

在进行完上述的准备工作以后，你就可以按照官方提供的训练脚本进行训练了。至此，你已经成功设计并训练了一个可在大多数嵌入式设备中近实时运行的检测器。大家如果对嵌入式CNN部署感兴趣的，也可以参考我专栏里之前的文章，尝试去部署你训练好的轻型Mobilenet-SSD。链接我会放在参考中，同时也欢迎大家留言讨论、关注专栏。谢谢大家！

推荐阅读

专注嵌入式端的AI算法实现，欢迎关注作者微信公众号和知乎嵌入式AI算法实现专栏。

更多嵌入式AI相关的技术文章请关注极术嵌入式AI专栏

推荐阅读

目录