本文通过学习Theano官方教程和自己操作实验，尝试实现了一个基于Theano的CNN。本文的提纲如下，也可以在站点右侧看到导航。

1. 卷积神经网络简介
    1.1 卷积神经网络的由来
    1.2 局部连接与权值共享
        1.2.1 局部连接
        1.2.2 权值共享
        1.2.3 细节
2. 卷积神经网络的实现
    2.1 卷积操作
    2.2 最大池化
3. 搭建完整的卷积层

卷积神经网络简介

卷积神经网络的由来(CNN)

卷积神经网络的提出基于生物学上的研究，是对多层感知机的一种改进。资料[1]介绍说这个想法来源于Hubel and Wiesel对猫的视觉系统的研究，简而言之就是发现视觉系统中两种神经元起到了抽取图像模式的作用：simple cell 可以识别边；complex cell 可以局部不变地提取图像模式的位置。（原文是 Complex cells have larger receptive fields and are locally invariant to the exact position of the pattern，我不是很理解）。

在神经网络的层面上，传统的多层神经网络（或者说是多层感知机）采用全连接的结构，在输入为图像的情况下会面临参数极多的情况。上面的这种启发如果应用在用于图像处理的神经网络上就能够有有效地减少参数数量。

局部连接与权值共享

局部连接

局部连接，顾名思义，相对于全连接的方法，一个矩阵中的元素仅部分与下一层的神经元相连。下面这幅图给出了直观的描述。

如图中所示，对于$1000\times1000$的图像而言，如果下一层神经元个数为$10^6$个，那么采用全连接的方法产生的参数数量则有$1000\times1000\times10^6=10^{12}$。相反，如果我们采用右图所示的局部连接方法，每个神经元只和$10\times10$个像素相连，那么参数就减少为$10^8$。

权值共享

权值共享进一步地减少了参数，对下一层神经元说：你们别各自拥有一套参数了，共享吧！这样子由于$10^6$个隐层神经元都共享一套参数，总的参数量下降到了惊人的$10\times10$！很刺激对不对，这么搞都行？记得上个小节介绍的内容吗，生物学的研究支持提供了现实基础，后来的实验也证明有效！

当然，那么多数量的神经元都共享一套参数似乎过于吝啬了，那么可以用多套参数，也就是图中的filters，这样能够提取更多的特征。

细节

下面我们来谈一下权值共享和局部连接这两个操作的形式化表达。局部连接其实就是 提取局部信息 ，我们需要多每一个隐层神经元都执行这样的操作而且它们共享相同的参数，所以可以把这个操作抽象出来，这就是我们眼熟已久的——卷积！

我们定义, $h^k$为第k个filter相关的隐层，对应参数是$W^k$和$b_k$（一个上标一个下标是有原因的）。那么$h^k$就可以通过下面的式子得到：

$$h^k_{ij} = \tanh((W^k\times x)_{ij} + b_k)$$

顺便回忆一下卷积操作:

Recall the following definition of convolution for a 1D signal. $$o[n] = f[n]*g[n] = \sum_{u=-\infty}^{\infty} f[u] g[n-u] = \sum_{u=-\infty}^{\infty} f[n-u] g[u]$$ This can be extended to 2D as follows: $$ o[m,n] = f[m,n]*g[m,n] = \sum_{u=-\infty}^{\infty} \sum_{v=-\infty}^{\infty} f[u,v] g[m-u,n-v] $$.

如果有k个filter，那么实际上就会有“k个隐层”。

我们认为卷积神经网络之所以能够成功是因为它能够依照一定的“原理”抽取出图像的特征，并且图像本身是支持这种特征的，但是为什么卷积神经网络在自然语言处理上同样起到作用呢？有人说这是神经网络的共通之处（因为RNN在图像中也开始大放异彩），但是我不是很理解。

卷积神经网络的实现

卷积操作

Theano 已经实现了一个2D卷积子：theano.tensor.signal.conv2d。这里解释一下，由于图像序列是个$2\times 2$的矩阵所以conv2d。2D卷积子的输入有两个：

4D tensor： [mini-batch size, number of input feature maps(filters), image height, image width]
4D tensor: [number of feature maps at layer m, number of feature maps at layer m-1, filter height, filter width]

下面的代码实现了一个类似与图4的卷积，用三个filters(以后都用features maps代替)，代表RGB三个色道。图像大小是120x160，卷积核大小是9x9。

import theano
from theano import tensor as T
from theano.tensor.nnet import conv2d
import numpy
rng = numpy.random.RandomState(12580) # 为了能够复现
# 初始化输入 4D tensor
input = T.tensor4(name='input')
# 初始化权值，采用theano的shared variable
w_shp = (2, 3, 9, 9) # 参数形状：m-1层核数，m层核数，核高，核宽
w_bound = numpy.sqrt(3 * 9 * 9) # 计算参数初始化的范围
W = theano.shared( numpy.array(
        rng.uniform(
            low = -1.0 / w_bound, 
            high = 1.0 / w_bound, 
            size = w_shp), 
        dtype=input.dtype), name='W')
# 初始化bias，1D tensor
# 注意：通常bias初始化为零，但是在这里我们为了模拟“学习过”
# 的效果，对它进行了初始化。
b _shp = (2, )
b = theano.shared(numpy.asarray(
            rng.uniform(lwo=-.5, high=.5, size=b_shp),
            dtype=input.dtype), name='b')
# 这里就明确了卷积操作
conv_out = conv2d(input, W)
# 最终输出到m层就的结果
#   ``dimshuffle`` is a powerful tool in reshaping a tensor;
#   what it allows you to do is to shuffle dimension around
#   but also to insert new ones along which the tensor will be
#   broadcastable;
#   dimshuffle('x', 2, 'x', 0, 1)
#   This will work on 3d tensors with no broadcastable
#   dimensions. The first dimension will be broadcastable,
#   then we will have the third dimension of the input tensor as
#   the second of the resulting tensor, etc. If the tensor has
#   shape (20, 30, 40), the resulting tensor will have dimensions
#   (1, 40, 1, 20, 30). (AxBxC tensor is mapped to 1xCx1xAxB tensor)
#   More examples:
#    dimshuffle('x') -> make a 0d (scalar) into a 1d vector
#    dimshuffle(0, 1) -> identity
#    dimshuffle(1, 0) -> inverts the first and second dimensions
#    dimshuffle('x', 0) -> make a row out of a 1d vector (N to 1xN)
#    dimshuffle(0, 'x') -> make a column out of a 1d vector (N to Nx1)
#    dimshuffle(2, 0, 1) -> AxBxC to CxAxB
#    dimshuffle(0, 'x', 1) -> AxB to Ax1xB
#    dimshuffle(1, 'x', 0) -> AxB to Bx1xA
output = T.nnet.sigmoid(conv_out + b.dimshuffle('x', 0, 'x', 'x'))
f = theano.function(['input'], output)

最大池化(MaxPooling)

卷积神经网络的另一个重要概念就是最大池化，是一种非线性向下采样，它把图像分成不重合的若干个部分，然后取每个部分最大的值。最大池化有一下两个优点使得它在计算机视觉中发挥作用：

减少计算量
提供了了一定的变换不变性，使得特征对于位置更鲁棒。

利用theano.tensor.signal.downsample.max_pool_2d可以实现MaxPooling层。它的输入是N维tensor和一个downscalling factor，这个factor由两个数字定义分别是高和宽的大小。我们来看一个例子：

import theano, numpy
import theano.tensor as T
from theano.tensor.signal import downsample
input = T.dtensor4('input')
maxpool_shape = (2, 2)
pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=True)
f = theano.function([input],pool_out)
invals = numpy.random.RandomState(1).rand(3, 2, 5, 5)
print 'With ignore_border set to True:'
print 'invals[0, 0, :, :] =\n', invals[0, 0, :, :]
print 'output[0, 0, :, :] =\n', f(invals)[0, 0, :, :]
pool_out = downsample.max_pool_2d(input, maxpool_shape, ignore_border=False)
f = theano.function([input],pool_out)
print 'With ignore_border set to False:'
print 'invals[1, 0, :, :] =\n ', invals[1, 0, :, :]
print 'output[1, 0, :, :] =\n ', f(invals)[1, 0, :, :]

结果如下：

With ignore_border set to True:
invals[0, 0, :, :] =
[[  4.17022005e-01   7.20324493e-01   1.14374817e-04   3.02332573e-01
    1.46755891e-01]
 [  9.23385948e-02   1.86260211e-01   3.45560727e-01   3.96767474e-01
    5.38816734e-01]
 [  4.19194514e-01   6.85219500e-01   2.04452250e-01   8.78117436e-01
    2.73875932e-02]
 [  6.70467510e-01   4.17304802e-01   5.58689828e-01   1.40386939e-01
    1.98101489e-01]
 [  8.00744569e-01   9.68261576e-01   3.13424178e-01   6.92322616e-01
    8.76389152e-01]]
output[0, 0, :, :] =
[[ 0.72032449  0.39676747]
 [ 0.6852195   0.87811744]]
With ignore_border set to False:
invals[1, 0, :, :] =
  [[ 0.01936696  0.67883553  0.21162812  0.26554666  0.49157316]
 [ 0.05336255  0.57411761  0.14672857  0.58930554  0.69975836]
 [ 0.10233443  0.41405599  0.69440016  0.41417927  0.04995346]
 [ 0.53589641  0.66379465  0.51488911  0.94459476  0.58655504]
 [ 0.90340192  0.1374747   0.13927635  0.80739129  0.39767684]]
output[1, 0, :, :] =
  [[ 0.67883553  0.58930554  0.69975836]
 [ 0.66379465  0.94459476  0.58655504]
 [ 0.90340192  0.80739129  0.39767684]]

值得注意的是，这里有个警告说downsample这个模块已经被转移到theano.signal.pool了。

搭建完整的卷积层

既然已经介绍了构建卷积层必要的组件，我们就可以动手来搭建卷积层了。直接上代码吧！这个模型参照的是LeNet(在之前的竞赛有突出表现，现在是一个比较标准的图像识别CNN结构了)。

class LeNetConvPoolLayer(object):
    """Pool Layer of a convolutional network """
    def __init__(self, rng, input, filter_shape, image_shape, poolsize=(2, 2)):
        """
        Allocate a LeNetConvPoolLayer with shared variable internal parameters.
        :type rng: numpy.random.RandomState， 随机数种子
        :param rng: a random number generator used to initialize weights
        :type input: theano.tensor.dtensor4，输入，其格式满足image_shape
        :param input: symbolic image tensor, of shape image_shape
        :type filter_shape: tuple or list of length 4，其格式上文提到了
        :param filter_shape: (number of filters, num input feature maps,
                              filter height, filter width)
        :type image_shape: tuple or list of length 4，其格式上文提到了
        :param image_shape: (batch size, num input feature maps,
                             image height, image width)
        :type poolsize: tuple or list of length 2
        :param poolsize: the downsampling (pooling) factor (#rows, #cols)
        """
        assert image_shape[1] == filter_shape[1]
        self.input = input
        # 每个隐层单元一共有下面这么多个输入
        # "num input feature maps * filter height * filter width"
        fan_in = numpy.prod(filter_shape[1:])
        # 下一层（原图在底层）会收到梯度，它的大小是：
        # "num output feature maps * filter height * filter width" / 
        #   pooling size
        fan_out = (filter_shape[0] * numpy.prod(filter_shape[2:]) //
                   numpy.prod(poolsize))
        # 初始化参数
        W_bound = numpy.sqrt(6. / (fan_in + fan_out))
        self.W = theano.shared(
            numpy.asarray(
                rng.uniform(low=-W_bound, high=W_bound, size=filter_shape),
                dtype=theano.config.floatX
            ),
            borrow=True
        )
        # 每个核都有一个1d tensor作为bias
        b_values = numpy.zeros((filter_shape[0],), dtype=theano.config.floatX)
        self.b = theano.shared(value=b_values, borrow=True)
        # 卷积操作
        conv_out = conv2d(
            input=input,
            filters=self.W,
            filter_shape=filter_shape,
            input_shape=image_shape
        )
        # 最大池化
        pooled_out = downsample.max_pool_2d(
            input=conv_out,
            ds=poolsize,
            ignore_border=True
        )
        # 这个还不是很理解。
        # add the bias term. Since the bias is a vector (1D array), we first
        # reshape it to a tensor of shape (1, n_filters, 1, 1). Each bias will
        # thus be broadcasted across mini-batches and feature map
        # width & height
        self.output = T.tanh(pooled_out + self.b.dimshuffle('x', 0, 'x', 'x'))
        # store parameters of this layer
        self.params = [self.W, self.b]
        # keep track of model input
        self.input = input

参考文献

[1]: Deeplearning Tutorial: Convolutional Neural Networks (LeNet)
[2]: Jey Zhang: 卷积神经网络(CNN)学习笔记1：基础入门