卷积神经网络学习笔记——DenseNet

完整代码及其数据，请移步小编的GitHub地址：请点击我

这里结合网络的资料和DenseNet论文，捋一遍DenseNet，基本代码和图片都是来自网络，这里表示感谢，参考链接均在后文。

DenseNet 论文写的很好，有想法的可以去看一下，我这里提供翻译地址：
深度学习论文翻译解析（十五）：Densely Connected Convolutional Networks

自ResNet提出以后，ResNet的变种网络层出不穷，都各有其特点，网络性能也有一定的提升。本文学习CVPR 2017最佳论文 DenseNet，论文中提出的 DenseNet（Dense Convolutional Network）主要还是和ResNet以及Inception网络做对比，思想上有所借鉴，但是却是全新的结构，网络结构并不复杂，却非常有效，在CIFAR指标上全面超越ResNet，可以说是DenseNet吸收了ResNet 最精华的部分，并在此上做了更加创新的工作，使得网络性能进一步提升。

1. ResNet VS DenseNet

首先，我们通过对ResNet的对比来大概了解一下 DenseNet。

下图为ResNet网络的短路连接机制（其中+代表的是元素级相加操作）。

可以看出ResNet是每个层与前面的某层（一般是2~3层）短路连接到一起，连接方式是通过元素级相加。

DenseNet的基本思路与ResNet一致，但是它建立的是前面所有层与后面层的密集连接（dense connection），它的名称也是由此而来。DenseNet的另外一大特色是通过特征在channel上的连接来实现特征重用（feature reuse）。这些特点让DenseNet在参数和计算成本更少的情形下实现比ResNet更优的性能。

相比ResNet，DenseNet提出了一个更激进的密集连接机制：即互相连接所有的层，具体来说就是每个层都会接受其前面所有层作为其额外的输入。ResNet是每个层与前面的某层（一般是2~3层）短路连接在一起，连接方式是通过元素级相加。而在DenseNet中，每个层都会与前面所有层在 channel维度上连接（Concat）在一起（这里各个层的特征图大小是相同的，后面会说明），并作为下一层的输入。对于一个 L 层的网络，DenseNet共包含L*(L+1)/2 个连接，相比ResNet，这是一种密集连接。而且DenseNet是直接Concat来自不同层的特征图，这可以实现特征重用，提高效率，这一特点是DenseNet和ResNet最主要的区别。

需要明确一点，Dense connectivity 仅仅是在一个 Dense Block 里的，不同 Dense Block 之间是没有Dense Connectivity的。

下图为DenseNet网络的密集连接机制（其中C代表的是 channel级连接操作），在DenseNet中直接 concat来自不同层的特征图，这可以实现特征重用，提升效率，这一特点是DenseNet与ResNet最主要的区别。

1.1 Keras中add和 concatenate 操作的不同

说起ResNet和DenseNet的区别了，就不得不说一下代码层面了，毕竟我们的目的是实现它。

首先说结论，ResNet的使用都是 add 操作，而DenseNet和InceptionNet使用的都是 concatenate操作。

关于 Concatenate 操作

拼接，H,W 都不改变，但是通道数增加。

网络结构设计中很重要的一种操作，经常用于将特征联合，多个卷积提取框架提取的特征融合或者是将输出层的信息进行融合。Densenet 是做通道的合并，而Concatnate 是通道数的合并，也就是说描述图像本身的特征增加了，而每一特征下的信息是没有增加的。

Keras 中 Concatnate 函数与 concatnate 函数

这里直接分析源码，不多分析只看区别：

首先是 Concatenate()函数：

class Concatenate(_Merge):
    """Layer that concatenates a list of inputs.
 
    It takes as input a list of tensors,
    all of the same shape except for the concatenation axis,
    and returns a single tensor, the concatenation of all inputs.
 
    # Arguments
        axis: Axis along which to concatenate.
        **kwargs: standard layer keyword arguments.
    """

再来是 concatenate()函数：

def concatenate(inputs, axis=-1, **kwargs):
    """Functional interface to the `Concatenate` layer.
 
    # Arguments
        inputs: A list of input tensors (at least 2).
        axis: Concatenation axis.
        **kwargs: Standard layer keyword arguments.
 
    # Returns
        A tensor, the concatenation of the inputs alongside axis `axis`.
    """
    return Concatenate(axis=axis, **kwargs)(inputs)

concatenate() 函数是 Concatenate() 函数的接口函数，我们可以使用两个中的任意一个，但是方法要写正确。后面我们会做代码验证。

关于 Add 操作

加，H,W,C 都不改变，只是相应元素的值会改变。

信息之间的叠加，ResNet是做值的叠加，通道数是不变的。add是描述图像的特征下的信息量增多了，但是描述图像的维度本身没有增加，只是在每一维度下信息量在增加。

Keras 中 Add 函数与 add 函数

这里直接分析源码，不多分析只看区别：

首先是 Add()函数：

class Add(_Merge):
    """Layer that adds a list of inputs.
 
    It takes as input a list of tensors,
    all of the same shape, and returns
    a single tensor (also of the same shape).
 
    # Examples
 
   python
        import keras
 
        input1 = keras.layers.Input(shape=(16,))
        x1 = keras.layers.Dense(8, activation='relu')(input1)
        input2 = keras.layers.Input(shape=(32,))
        x2 = keras.layers.Dense(8, activation='relu')(input2)
        # equivalent to added = keras.layers.add([x1, x2])
        added = keras.layers.Add()([x1, x2])
 
        out = keras.layers.Dense(4)(added)
        model = keras.models.Model(inputs=[input1, input2], outputs=out)
    
    """

再来是 add()函数：

def add(inputs, **kwargs):
    """Functional interface to the `Add` layer.
 
    # Arguments
        inputs: A list of input tensors (at least 2).
        **kwargs: Standard layer keyword arguments.
 
    # Returns
        A tensor, the sum of the inputs.
 
    # Examples
    python
        import keras
 
        input1 = keras.layers.Input(shape=(16,))
        x1 = keras.layers.Dense(8, activation='relu')(input1)
        input2 = keras.layers.Input(shape=(32,))
        x2 = keras.layers.Dense(8, activation='relu')(input2)
        added = keras.layers.add([x1, x2])
 
        out = keras.layers.Dense(4)(added)
        model = keras.models.Model(inputs=[input1, input2], outputs=out)
    
    """
    return Add(**kwargs)(inputs)

add() 函数是 Add() 函数的接口函数，我们可以使用两个中的任意一个，但是方法要写正确。后面我们会做代码验证。

代码展示Keras中四个函数的区别

代码如下：

from keras.layers import Concatenate, Add, add, concatenate
import numpy as np
import tensorflow as tf
 
matrix1 = np.array([[1,2,3], [4,5,6]])
matrix2 = np.array([[11,22,33], [44,55,66]])
# 将一个numpy数据转换为tensor
t1 = tf.convert_to_tensor(matrix1)
t2 = tf.convert_to_tensor(matrix2)
print(t1)
print(t2)
'''
    [[1 2 3]
     [4 5 6]]
 
    [[11 22 33]
     [44 55 66]]
 
    Tensor("Const:0", shape=(2, 3), dtype=int32)
    Tensor("Const_1:0", shape=(2, 3), dtype=int32)
'''
exp_Add = Add()([t1, t2])
exp_Concatenate = Concatenate()([t1, t2])
print(exp_Add)
print(exp_Concatenate)
# 要对tensor进行操作，需要先启动一个Session
with tf.Session() as sess:
    print('exp_Concatenate is ', exp_Add.eval())
    print('exp_Concatenate is ', exp_Concatenate.eval())
'''
    exp_Concatenate is  [[12 24 36]
     [48 60 72]]
    exp_Concatenate is  [[ 1  2  3 11 22 33]
     [ 4  5  6 44 55 66]]
 
    Tensor("add_1/add:0", shape=(2, 3), dtype=int32)
    Tensor("concatenate_1/concat:0", shape=(2, 6), dtype=int32)
'''
exp_Add1 = add([t1, t2])
exp_Concatenate1 = concatenate([t1, t2])
print(exp_Add1)
print(exp_Concatenate1)
'''
    Tensor("add_2/add:0", shape=(2, 3), dtype=int32)
    Tensor("concatenate_2/concat:0", shape=(2, 6), dtype=int32)
'''
with tf.Session() as sess:
    print(exp_Add1.eval() == exp_Add.eval())
    print(exp_Concatenate1.eval() == exp_Concatenate.eval())
'''
    [[ True  True  True]
     [ True  True  True]]
    [[ True  True  True  True  True  True]
     [ True  True  True  True  True  True]]
'''

2. DenseNet网络架构

当CNNs增加深度的时候，就会出现一个紧要的问题：当输入或者梯度的信息通过很多层之后，它可能会消失或过度膨胀。研究表明，如果卷积网络在接近输入和接近输出地层之间包含较短地连接，那么，该网络可以显著地加深，变得更精确并且能够更有效的训练。在论文中提出的架构为了确保网络层之间的最大信息流，将所有层直接彼此连接。为了保持前馈特性，每个层从前面的所有层获得额外的输入，并将自己的特征映射传递到后面的所有层。该论文基于这个观察提出了以前馈的方式将每个层与其他层连接的密集卷积网络（DenseNet）。

原作者通过观察目前深度网络的一个重要特点就是都加入了 shorter connections，能够让网络更深，更准确，更高效。作者充分利用了 skip connections ，设计了一种稠密卷积神经网络（Dense Convolutional Network），让每一层都接受它前面所有层的输出。对于传统卷积结构，L层一共有L个 connections，但DenseNet，L层一共有L（L-1）/2 个 Connection。

2.1 DenseNet闪光点

相比ResNet 拥有更少的参数数量
旁路加强了特征的重用
网络更易于训练，并具有一定的正则效果
缓解了梯度消失（gradient vanishing）和模型退化（model degradation）的问题

2.2 DenseNet 网络分析

DenseNet 是一种具有密集连接的卷积神经网络。在该网络中，任何两层之间都有直接的连接，也就是说，网络每一层的输入都是前面所有层输出的并集，而该层所学习的特征图也会被直接传给其后面所有层作为输入。

下图是一个五层的密集块：

DenseNet 的前向过程如上图所示，可以更直观地理解其密集连接方式，比如 h3 的输入不仅包括来自 h2 的 x2，还包括前面两层的 x1 和 x2，他们是在 channel 维度上连接在一起的。

下图给出了DenseNet的网络结构，它共包含 4个 DenseBlock，各个 DenseBlock之间通过Transition 连接在一起。

CNN网络一般要经过Pooling或者 stride>1 的Conv 来降低特征图的大小，而DenseNet的密集连接方式需要特征图大小保持一致。为了解决这个问题，DenseNet网络中使用 DenseBlock + Transition 的结构，其中 DenseBlock 是包含很多层的模块，每个层的特征图大小相同，层与层之间采用密集连接方式。而 Transition模块是连接两个相邻的 DenseBlock ，并且通过 Pooling使特征图大小降低。

2.3 Dense Block

首先展示一下Dense Block网络结构：

Dense Block模块：BN + ReLU + Conv(3 * 3) + dropout

transition layer模块：BN + ReLU + Conv(1 1)(filter_num:m) + dropout + Pooling(2 2)

我们知道DenseNet的网络结构主要由DenseBlock和 Transition组成，下面具体来学习网络的实现细节，首先看网络结构：

在DenseBlock中，各个层的特征图大小一致，可以在channel维度上连接。DenseBlock中的非线性组合 H() 采用的是 BN+ReLU+33Conv 的结构，如下图所示：

该架构与ResNet相比，在将特性传递到层之前，没有通过求和来组合特性，而是通过连接他们的方式来组合特性。因此第 x 层（输入层不算在内）将由 x个输出的特征图，这些输入是之前所有层提取出的特征信息，或者说采用 x 个卷积核， x 在DenseNet称为 growth rate，这是一个超参数，一般情况下使用较小的 k（比如12），就可以得到较佳的性能。假定输入层的特征图的 channels 数为 k0，那么 l 层的输入的 channel 数为 k0 + k(l - 1)，因此随着层数的增加，尽管k设定的较小，DenseBlock的输入会非常多，不过这是由于特征重用所造成的，每个层仅有 K 个特征是自己独有的，因为它的密集连接特性，研究人员将其称为 Dense Convolutional Network（DenseNet）。

因为不需要重新学习冗余特征图，这种密集连接模式相对于传统的卷积网络只需要更少的参数。传统的前馈体系结构可以看做是具有一种状态的算法，这种状态从一个层传递到另一个层。每个层从其前一层读取状态并将其写入后续层。它改变状态，但也传递需要保留的信息。研究提出的密集网络体系结构明确区分了添加到网络的信息和保留的信息。密集网层非常窄（例如：每层12个过滤器），仅向网络的“集体知识”添加一小组特征映射，并且保持其余特征映射不变，并且最终分类器基于网络中的所有特征映射做出决策。

除了参数更少，另一个DenseNets 的优点是改进了整个网络的信息流和梯度，这使得他们易于训练。每个层直接访问来自损失函数和原始输入信号的梯度，带来了隐式深度监控。这使得训练深层网络变得更简单。此外，研究人员观察到密集连接具有规则化效果，这减少了对训练集较小的任务的过拟合。

2.4 DenseNet-B

首先展示一下 DenseNet-B 网络结构：

Dense Block模块：BN + ReLU + Conv(1 * 1)(filter_num:4K) + dropout + BN + ReLU + Conv(3 * 3) + dropout

transition layer模块：BN + ReLU + Conv(1 * 1)(filter_num:m) + dropout + Pooling(2 * 2)

密集连接不会带来冗余吗？不会！密集连接这个词给人的第一感觉就是极大地增加了网络的参数量和计算量。但是实际上DenseNet比其他网络效率更高，其关键就在于网络每层计算量的减少以及特征的重复利用。DenseNet 则是让 l 层的输入直接影响到之后的所有层，它的输出为：xl = H1([X0, X1, ... Xl-1])，其中 [X0, x1, ...Xl-1] 就是将之前的 feature map 以通道的维度进行合并。并且由于每一层都包含之前所有层的输出信息，因此其只需要很少的特征图就够了，这也是为什么 DenseNet的参数量较其他模型大大减少的原因。这种Dense Connection 相当于每一层都直接连接 input 和 loss，因此就可以减轻梯度消失现象，这样更深网络不是问题。需要明确一点，Dense Connectivity 仅仅是在一个 Dense Block里的，不同Dense Block 之间是没有Dense Connectivity的，比如下图所示：

天底下没有免费的午餐，网络自然也不例外。在同层深度下获得更好的收敛率，自然是由额外代价的，其代价之一就是其恐怖如斯的内存占用。

每一个DenseBlock模块的输出维度有多大呢？

假设一个L层的Dense Block模块中输出K个 feature map，即网络增长率为K，其中已经加入了Bottleneck单元，那么第L层的输入为K0 + K *（L-1）（其中第K0为输入层的维度），而总共输出的维度为：第一层的维度 + 第二层的维度 + 第三层的维度 + ... + 第L层的维度，加入Bottleneck单元后每层的输出维度为4K，那么最终 Dense Block模块的输出维度为4K 乘以 L。也就是说随着Dense Block 模块深度的加深，即随着层数L的增加，最终输出的 feature map 的维度也是一个很大的数，为了解决这个问题，在transition layer模块中加入了1* 1卷积做降维。

为了解决这个问题，在Dense Block模块中加入了 Bottleneck单元，如下图所示，即 BN + ReLU + 1 乘以 1Conv + BN + ReLU + 3 * 3 Conv，称为 DenseNet-B结构。其中1* 1Conv降维得到 4k 个特征图它起到的作用是降低特征数量，从而提升计算效率（K为增长率）。

2.5 DenseNet-BC

首先展示一下 DenseNet-BC 网络结构：

Dense Block模块：BN + ReLU + Conv(1 * 1)(filter_num:4K) + dropout + BN + ReLU + Conv(3 * 3) + dropout

transition layer模块：BN + ReLU + Conv(1 * 1)(filter_num:θm，其中 0<θ<1，文章取θ=0.5) + dropout + Pooling(2 * 2)

对于 Transition层，它主要是连接两个相邻的 DenseBlock，并且降低特征图大小。Transition层包括一个1 * 1的卷积和 2 * 2 的 AvgPooling，结构为：BN + ReLU + 1 * 1 Conv + 2 * 2 AvgPooling。另外，Transition层可以起到压缩模型的作用。假定 Transition的上接 DenseBlock得到的特征图 channels 数为 m，Transition可以产生 |θm| 个特征（通过卷积层），其中 θ € （0， 1] 是压缩系数（compression rate），当 θ = 1 时，特征个数经过 Transition层没有变换，即无压缩，而当压缩系数小于1时，这种结构称为 DenseNet-C，文章使用 θ=0.5 。对于使用 Bottleneck 层的 DenseBlock结构和压缩系数小于1的Transition 组合结构称为 DenseNet-BC。

3. DenseNet的优缺点分析

参考地址：https://blog.csdn.net/comway_Li/article/details/82055229

DenseNet 的核心思想在于建立了不同层之间的连接关系，充分利用了feature，进一步减轻了梯度消失问题，加深网络不是问题，而且训练效果非常好。另外利用bottleneck layer，Transition layer 以及较小的 growth rate 使得网络变窄，参数减少，有效抑制了过拟合，同时计算量也减少了。DenseNet优点很多，而且在和ResNet的对比中优势还是非常明显的。

3.1 作者提出算法的出发点

目前来看，深度卷积网络挑战主要有：

1，Underfitting（欠拟合）：一般来说，模型越为复杂，表现能力越强，越不融合欠拟合。但是深度网络不一样，表现表达能力够，但是算法不能达到那个全局最优（ResNet基本解决）
2，Overfitting（过拟合）：泛化能力下降
3，实际系统的部署问题，如何提升效率和减少内存，能量消耗

那么如何消除上述的冗余性？得到更紧凑的结构？更好的泛化性能？由随机网络深度，我们就得知训练时扔掉大部分层却效果不错，说明冗余性很多，每一层干的事情很少，只学一点东西。

所以目的就是减少不必要的计算，提高泛化性能。

3.2 算法优点

综合来看，DenseNet的优势主要体现在以下几个方面：

抗过拟合，由于密集连接方式，DenseNet提升了梯度的反向传播，使得网络更容易训练。由于每层可以直达最后的误差信号，实现了隐式的“deep supervision”；所以DenseNet具有非常好的抗过拟合性能，尤其适合于训练数据相对匮乏的应用。

参数更小且计算更高效，这有点违反直觉，由于DenseNet是通过concat特征来实现短路连接，实现了特征重用，并且采用较小的growth rate，每个层所独有的特征图是比较小的；达到了与ResNet相当的精度，DenseNet所需的计算量也只有ResNet的一半左右。计算效率在深度学习实际应用中的需求非常强烈。
泛化性更强，如果没有 data augmention，CIFAR-100下，ResNet表现下降很多，DenseNet下降不多，说明DenseNet泛化性能更强。

要注意的一点是，如果实现方式不当的话，DenseNet可能耗费很多GPU显存，一种高效的实现如下图所示，更多细节可以见这篇论文Memory-Efficient Implementation of DenseNets。不过我们下面使用Pytorch框架可以自动实现这种优化。

3.3 如何对DenseNet的模型做改进

每层开始的瓶颈层（1 * 1 卷积）对于减少参数量和计算量非常有用
像VGG和ResNet那样每做一次下采样（down-sampling）之后都把层宽度（growth rate）增加一倍，可以提高 DenseNet 的计算效率（FLOPS efficiency）
与其他网络一样，DenseNet的深度和宽度应该均衡的变化，当然DenseNet 每层的宽度要远小于其他模型
每一层设计得较窄会降低 DenseNet 在GPU 上的运行效率，但可能会提高再 CPU 上的运行效率

3.4 DenseNet 是否耗费显存

如果出现DenseNet在训练时对内存消耗非常厉害。这个问题其实是算法实现不优带来的。当前的深度学习框架对 DenseNet 的密集连接没有很好的支持，我们只能借助于反复的拼接（Concatenation）操作，将之前层的输出与当前层的输出拼接在一起，然后传给下一层。对于大多数框架（如 Torch 和 TensorFlow），每次拼接操作都会开辟新的内存来保存拼接后的特征。这样就导致一个L层的网络，要消耗相当于L(L+1)/2 层网络的内存（第i 层的输出在内存里被存了 (L-i+1)份）。

解决这个问题的思路其实并不能，我们只需要预先分配一块缓存，供网络中所有的拼接村（Concatenation layer）共享使用，这样DenseNet对内存的消耗便从平方级降到了线性级别。在梯度反传过程中，我们再把相应卷积层的输出复制到该缓存，就可以重构每一层的输入特征，进而计算梯度。当然网络中由于Batch Normalization层的存在，实现起来还有一些需要注意的细节。

新的实现极大地减少了 DenseNet 在训练时对显存的消耗，比如论文中 190 层的 DenseNet 原来几乎占满了 4块 12G的内存的GPU，而优化后的代码仅需要 9G 的显存，在单卡上就能训练。

另外就是网络在推理（或测试）的时候对内存的消耗，这个是我们在实际产品中（尤其是在移动设备上）部署深度学习模型时最关心的问题。不同于训练，一般神经网络的推理过程不需要一直保留每一层的输出，因此可以在每计算好一层的特征后便将前面层特征占用的内存释放掉，而DenseNet则需要始终保存所有前面层的输出。但是考虑到 DenseNet每一层产生的特征图很少，所以在推理的时候占用内存不会多于其他网络。

4. DenseNet算法总结分析

4.1. DenseNet网络三种结构的区分

文章同时提出了DenseNet，DenseNet-B，DenseNet-BC三种结构，上面也学习了，这里再单独提出来，具体区别如下：
DenseNet

Dense Block模块：`BN+Relu+Conv(3 * 3)+dropout
transition layer模块：BN+Relu+Conv(1 1)(filternum:m)+dropout+Pooling(2 2)
DenseNet-B`

Dense Block模块：`BN+Relu+Conv(1 1)(filternum:4K)+dropout+BN+Relu+Conv(3 3)+dropout
transition layer模块：BN+Relu+Conv(1 1)(filternum:m)+dropout+Pooling(2 2)
DenseNet-BC`

Dense Block模块：BN+Relu+Conv(1 * 1)(filternum:4K)+dropout+BN+Relu+Conv(3 * 3)+dropout
transition layer模块：BN+Relu+Conv(1 * 1)(filternum:θm，其中0<θ<1，文章取θ=0.5) +dropout +Pooling(2 * 2)

其中，DenseNet-B在原始DenseNet的基础上，加入Bottleneck layers, 主要是在Dense Block模块中加入了1 * 1卷积，使得将每一个layer输入的feature map都降为到4k的维度，大大的减少了计算量。

4.2 问题1：pooling

因为神经网络从输入到输出趋势就是 channel 数逐渐增加， feature map逐渐缩小，而使 feature map 缩小的操作就是 pooling，pooling 前后 feature map 不一样，这种情况下 concatenation 是没有用的，这种情况下论文将一个大网络分成几个 dense blocks，中间使用 transition layers（一个作用就是 pooling）进行连接。

transition block层的代码如下：

def transition_block(input,nb_filter,dropout_rate=None,pooltype=1,weight_decay=1e-4):
    x = BatchNormalization(axis=-1,epsilon=1.1e-5)(input)
    x = Activation('relu')(x)
    x = Conv2D(nb_filter,(1,1),kernel_initializer='he_normal', padding='same', use_bias=False,
               kernel_regularizer=l2(weight_decay))(x)
 
    if(dropout_rate):
        x = Dropout(dropout_rate)(x)
 
    if(pooltype==2):
        x = AveragePooling2D((2,2),strides=(2,2))(x)
    elif(pooltype==1):
        x = ZeroPadding2D(padding=(0,1))(x)
        x = AveragePooling2D((2,2),strides=(2,1))(x)
    elif(pooltype==3):
        x = AveragePooling2D((2,2),strides=(2,1))(x)
return x,nb_filter

4.3 问题2：指数增长的通道数

当看到DenseNet的公式的时候，我们肯定会想到通道数增长速度的问题。

输入通道数为 c0 = k0，卷积不改变通道数，那么第一层通道数为 c1 = c0 + k0 = 2k0，第二层通道数为 c2 = c0+c1+k0 = K0+2K0+k0 =4K0，...，很轻松证明：Cn = 2n-1k0，这种指数级别的通道数是不允许存在的，过多的通道数会极大的增加参数量，从而降低运行速度。

所以论文首先给出了一个限定条件，即方程 Hl( ) 输出的通道数不是 Cn = 2n-1k0，而是一个固定的k（论文使用了一个专门的术语growth rate 表示这个参数），即每一层都是固定的通道数，但是输入 [x0, x1, ....xn] 的通道数为 k0+k(l-1)，这种通道数的差别表明方程 Hl( ) 有一个维度压缩过程。

growth rate 靠卷积得到固定通道数：

def conv_block(input,growth_rate,dropout_rate=None,weight_decay=1e-4):
    x = BatchNormalization(axis=-1,epsilon=1.1e-5)(input)
    x = Activation('relu')(x)
    x = Conv2D(growth_rate,(3,3),kernel_initializer='he_normal', padding = 'same')(x)
    if(dropout_rate):
        x = Dropout(dropout_rate)(x)
return x

论文认为固定为 k 的通道数表明了网络的全局状态，而随着 feature map 的逐步缩小，表示信息越来越集中了，越来越成为高级特征了。

但是论文研究又发现L和k的增加会使得网络表现更好，注意，k=8的结果会比k=32的结果要差几个百分点，但是 k=32会让网络的显存占用超级大。

也可以使用 1 1 的卷积网络，所以总的网络就变成了 BN-ReLU-Conv(1 1)-BN-ReLU-Conv(3 3)，输出的通道数为 4k 能够利用 1 1 卷积核先将通道数降低（毕竟 k 为32显得有点小，实际项目还可能设置为 8）。

dense_block的代码：

def dense_block(x,nb_layers,nb_filter,growth_rate,droput_rate=0.2,weight_decay=1e-4):
    for i in range(nb_layers):
        cb = conv_block(x,growth_rate,droput_rate,weight_decay)
        x = concatenate([x,cb],axis=-1)
        nb_filter +=growth_rate
return x ,nb_filter

4.4 问题3：参数过多

DenseNet不断堆积卷积网络，参数增长是很明显的，所以一般使用卷积网络压缩输入：The initial convolution layer comprises 2k convolutions of size 7 * 7 with stride 2；

对应的代码为：

_nb_filter = 64
# conv 64  5*5 s=2
 
x = Conv2D(_nb_filter ,(5,5),strides=(2,2),kernel_initializer='he_normal', padding='same',
    use_bias=False, kernel_regularizer=l2(_weight_decay))(input)

为了防止 dense block 将卷积通道数线性增加，使得后期通道数过多，transition layers 一般会对通道数进行缩减：to further improve model compactness, we can reduce the number of feature-maps at transition layers.

对应的代码为：

# 64 +  8 * 8 = 128
x ,_nb_filter = dense_block(x,8,_nb_filter,8,None,_weight_decay)
#128
x,_nb_filter = transition_block(x,128,_dropout_rate,2,_weight_decay)
 
#128 + 8 * 8 = 192
x ,_nb_filter = dense_block(x,8,_nb_filter,8,None,_weight_decay)
#192->128
x,_nb_filter = transition_block(x,128,_dropout_rate,2,_weight_decay)
 
#128 + 8 * 8 = 192
x ,_nb_filter = dense_block(x,8,_nb_filter,8,None,_weight_decay)

可以看到 transition block 的通道数都是 128，实际上后一个 transition block 输入的通道数为 192，进行了通道的缩减。

4.5 热力图分析

在设计初，DenseNet便被设计成让一层网络可以使用所有值钱层网络 feature map 的网络结构，为了探索 feature 的复用情况，作者进行了相关实验，作者训练的 L=40,K=12 的DenseNet，对于任意Dense Block中的所有卷积层，计算之前某层 feature map 在该层权重的绝对值平均数，这一平均数表明了这一层对于之前某一层 feature 的利用率，下图为由该平均数绘制出的热力图：

从图中我们可以得出以下结论：

一些较早层提取出的特征仍可能被较深层直接使用
即使是 Transition Layer也会使用到之前 DenseBlock中所有层的特征
第2~3个DenseBlock中的层对之前Transition Layer利用率很低，说明 transition layer 输出大量冗余特征，这也为DenseNet-BC提供了证据支持，即Compression的必要性
最后的分类层虽然使用了之前DenseBlock的多层信息，但是更偏向于使用最后几个 feature map 的特征，说明在网络的最后几层，某些 high-level 的特征可能被产生。

5. Keras实现

训练的话：对于没有使用数据增强的数据集，在卷积层后使用了 dropout ，dropout的比例为 0.2。

DenseNet 的一个版本的Keras实现如下（如果想看其他版本，请去我的GitHub拿，地址在上面）：

'''
    this script is DenseNet model for Keras
    link: https://www.jianshu.com/p/274d050d517e
 
'''
from __future__ import print_function
from __future__ import absolute_import
from __future__ import division
 
from keras.models import Model, Sequential
from keras.layers import Flatten, Dense, Input, Conv2D, MaxPooling2D, BatchNormalization, Dropout
from keras.optimizers import Adam
from keras.layers import Input, Activation
from keras.layers.pooling import GlobalAveragePooling2D, AveragePooling2D, MaxPooling2D
from keras.regularizers import l2
from keras.layers import Concatenate, Add, add, concatenate
from keras.layers.normalization import BatchNormalization
 
import keras.backend as K
 
 
def conv_block(input_shape, nb_filter, bottleneck=False, dropout_rate=None, weight_decay=1e-4):
    '''
        Apply BatchNorm, Relu, 3*3 Conv2D, optional bottleneck block and dropout
        Args:
            input_shape: Input keras tensor
            nb_filter: number of filters
            bottleneck: add bottleneck block
            dropout_rate: dropout rate
            weight_decay: weight decay factor
        returns: keras tensor with batch_norm, relu and convolution2d added (optional bottleneck)
 
    '''
    # 表示特征轴，因为连接和BN都是对特征轴来说
    concat_axis = 1 if K.image_data_format() == 'channel_first' else -1
 
    x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(input_shape)
    x = Activation('relu')(x)
 
    # bottleneck 表示是否使用瓶颈层，也就是使用1*1的卷积层将特征图的通道数进行压缩
    if bottleneck:
        inter_channel = nb_filter * 4
        # He正态分布初始化方法，参数由0均值，标准差为sqrt(2 / fan_in) 的正态分布产生，其中fan_in权重张量的扇入
        x = Conv2D(inter_channel, (1, 1), kernel_initializer='he_normal', padding='same',
            use_bias=False, kernel_regularizer=l2(weight_decay))(x)
        x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(x)
        x = Activation('relu')(x)
 
    x = Conv2D(nb_filter, (3, 3), kernel_initializer='he_normal', padding='same',
        use_bias=False)(x)
 
    if dropout_rate:
        x = Dropout(dropout_rate)(x)
 
    return x
 
 
def transition_block(input_shape, nb_filter, compression=1.0, weight_decay=1e-4, is_max=False):
    '''
        过渡层是用来连接两个 dense block,同时在最后一个dense block的尾部不需要使用过渡层
        按照论文的说法：过渡层由四部分组成：
            BatchNormalization ReLU 1*1Conv  2*2Maxpooling
        Apply BatchNorm, ReLU , Conv2d, optional compression, dropout and Maxpooling2D
        Args:
            input_shape: keras tensor
            nb_filter: number of filters
            compression: caculated as 1-reduction, reduces the number of features maps in the transition block
                    (compression_rate 表示压缩率，将通道数进行调整)
            dropout_rate: dropout rate
            weight_decay: weight decay factor
        return :
            keras tensor, after applying batch_norm, relu-conv, dropout maxpool
    '''
 
    # 表示特征轴，因为连接和BN都是对特征轴来说
    concat_axis = 1 if K.image_data_format() == 'channel_first' else -1
 
    x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(input_shape)
    x = Activation('relu')(x)
    x = Conv2D(int(nb_filter * compression), (1, 1), kernel_initializer='he_normal',
        padding='same', use_bias=False, kernel_regularizer=l2(weight_decay))(x)
 
    # 论文提出使用均值池化层来做下采样，不过在边缘提取方面，最大池化层效果应该更好，可以加上接口
    if is_max:
        x = Maxpooling2D((2, 2), strides=(2, 2))(x)
    else:
 
        x = AveragePooling2D((2, 2), strides=(2, 2))(x)
 
    return x
 
 
def dense_block(input_shape, nb_layers, nb_filter, growth_rate, bottleneck=False, dropout_rate=None,
    weight_decay=1e-4, grow_nb_filters=True, return_concat_list=False):
    '''
        Bulid a dense_block where the output of each conv_block is fed to subsequent ones
        此处使用循环实现了Dense Block 的密集连接
 
        Args:
            input_shape: keras tensor
            nb_layers: the number of layers of conv_block to append to the model
            nb_filter: number of filters
            growth_rate: growth rate
            weight_decay: weight decay factor
            grow_nv_filters: flag to decode to allow number of filters to grow
            return_concat_list: return the list of feature maps along with the actual output
 
        returns:
            keras tensor with nv_layers of conv_block append
 
        其中 x=concatenate([x, cb], axis=concat_axis)操作使得x在每次循环中始终维护一个全局状态
        第一次循环输入为 x， 输出为 cb1，第二次输入为 cb=[x,cb1]，输出为cb2，第三次输入为cb=[x,cb1,cb2]，输出为cb3
        以此类推，增长率为growth_rate 其实就是每次卷积时使用的卷积核个数，也就是最后输出的通道数。
    '''
    concat_axis = 1 if K.image_data_format() == 'channel_first' else -1
 
    x_list = [input_shape]
 
    for i in range(nb_layers):
        cb = conv_block(input_shape, growth_rate, bottleneck, dropout_rate, weight_decay)
        x_list.append(cb)
 
        x = concatenate([input_shape, cb], axis=concat_axis)
 
        if grow_nb_filters:
            nb_filter += growth_rate
 
    if return_concat_list:
        return x, nb_filter, x_list
    else:
        return x, nb_filter
 
 
def DenseNet_model(input_shape, classes, depth=40, nb_dense_block=3, growth_rate=12, include_top=True,
        nb_filter=-1, nb_layers_per_block=[6, 12, 32, 32], bottleneck=False, reduction=0.0, dropout_rate=None,
        weight_decay=1e-4, subsample_initial_block=False, activation='softmax'):
    '''
        Build the DenseNet model
 
        Args:
            classes: number of classes
            input_shape: tuple of shape (channels, rows, columns) or (rows, columns, channels)
            include_top: flag to include the final Dense layer
            depth: number or layers
            nb_dense_block: number of dense blocks to add to end (generally = 3)
            growth_rate: number of filters to add per dense block
            nb_filter: initial number of filters. Default -1 indicates initial number of filters is 2 * growth_rate
            nb_layers_per_block: number of layers in each dense block.
                    Can be a -1, positive integer or a list.
                    If -1, calculates nb_layer_per_block from the depth of the network.
                    If positive integer, a set number of layers per dense block.
                    If list, nb_layer is used as provided. Note that list size must
                    be (nb_dense_block + 1)
            bottleneck: add bottleneck blocks
            reduction: reduction factor of transition blocks. Note : reduction value is inverted to compute compression
            dropout_rate: dropout rate
            weight_decay: weight decay rate
            subsample_initial_block: Set to True to subsample the initial convolution and
                    add a MaxPool2D before the dense blocks are added.
            subsample_initial:
            activation: Type of activation at the top layer. Can be one of 'softmax' or 'sigmoid'.
                    Note that if sigmoid is used, classes must be 1.
        Returns: keras tensor with nb_layers of conv_block appended
    '''
 
    concat_axis = 1 if K.image_data_format() == 'channel_first' else -1
 
    if type(nb_layers_per_block) is not list:
        print('nb_layers_per_block should be a list !!!')
        return 0
 
    final_nb_layer = nb_layers_per_block[-1]
    nb_layers = nb_layers_per_block[:-1]
 
    # compute initial nb_filter if -1 else accept users initial nb_filter
    if nb_filter <= 0:
        nb_filter = 2 * growth_rate
 
    # compute compression factor
    compression = 1.0 - reduction
 
    # initial convolution
    if subsample_initial_block:
        initial_kernel = (7, 7)
        initial_strides = (2, 2)
    else:
        initial_kernel = (3, 3)
        initial_strides = (1, 1)
 
    Inp = Input(shape=input_shape)
    x =Conv2D(nb_filter, initial_kernel, kernel_initializer='he_normal', padding='same',
        strides=initial_strides, use_bias=False, kernel_regularizer=l2(weight_decay))(Inp)
 
    if subsample_initial_block:
        x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(x)
        x = Activation('relu')(x)
        x = Maxpooling2D((3, 3), strides=(2, 2), padding='same')(x)
 
    # add dense blocks
    for block_index in range(nb_dense_block-1):
        x, nb_filter = dense_block(x, nb_layers[block_index], nb_filter, growth_rate,
            bottleneck=bottleneck, dropout_rate=dropout_rate, weight_decay=weight_decay)
 
        # add transition block
        x = transition_block(x, nb_filter, compression=compression, weight_decay=weight_decay)
        nb_filter = int(nb_filter * compression)
 
    # the last dense block does not have a transition_block
    x, nb_filter = dense_block(x, final_nb_layer, nb_filter, growth_rate, bottleneck=bottleneck,
        dropout_rate=dropout_rate, weight_decay=weight_decay)
 
    x = BatchNormalization(axis=concat_axis, epsilon=1.1e-5)(x)
    x = Activation('relu')(x)
    x = GlobalAveragePooling2D()(x)
 
    if include_top:
        x = Dense(classes, activation=activation)(x)
 
    model = Model(Inp, output=x)
    model.summary()
 
    return model
 
 
 
if __name__ == '__main__':
    DenseNet_model(input_shape=(227, 227, 3), classes=1000, bottleneck=True,
        reduction=0.5)

Keras实现地址：https://github.com/titu1994/DenseNet
https://github.com/flyyufelix/DenseNet-Keras
实现：https://blog.csdn.net/shi2xian2wei2/article/details/84425777

参考文献

作者：战争热诚
文章来源：博客园

推荐阅读

更多嵌入式AI干货请关注嵌入式AI专栏。欢迎添加极术小姐姐微信（id:aijishu20)加入技术交流群，请备注研究方向。