Tucker-2 Decomposition on Convolutional Network Layer

Use Tucker-2 Decomposition on convolutional layer to compress CNNs. We test this compression techniques on two data sets: FashionMNIST and CIFAR-10.

Check it out here Github.

Convolutional Kernel Tensor

A convolutional layer in neural networks is represented by a four-way convolutional kernel tensor \(\mathcal{K}\in\mathbb{R}^{T\times S\times D\times D}\) with stride \(\Delta\) and padding \(P\), where \(T,S,D\times D\) are the number of output channels, the number of input channels and the size of kernel respectively. Given a input tensor \(\mathcal{X}\in\mathbb{R}^{S\times H\times W}\), the output tensor \(\mathcal{Y}\in\mathbb{R}^{T\times H^\prime \times W^\prime}\) is calculated element-wise by:

\[y_{th^\prime w^\prime }=\sum_{i=1}^D\sum_{j=1}^D\sum_{s=1}^{S} k_{tsij}x_{sh_i w_j}\]

where \(h_i=(h^\prime -1)\Delta + i-P\), \(w_j=(w^\prime -1)\Delta + j-P\). We call stride and padding as spatial operations because they change the sizes of input images. We denote such convolutional layer as \(\text{Conv2D }\mathcal{K}(T,S,D,D),\Delta,P\).

Tucker-2 Decomposition on Convolutional Kernel Tensor

In general, we could perform tucker decomposition on a whole convolutional kernel tensor. However, there are two reasons that we need to consider:

The size of the kernel is usually very small (common values are 3 or 5). As result, we do not want to reduce the dimension of mode-3 or mode-4.
If we perform tucker decomposition on mode-3 and mode-4, we need to deal with spatial operations after decomposition, which will make the procedure much more complicated.

Instead, we perform Tucker-2 Decomposition on the convolutional kernel tensor. Tucker-2 Decomposition enforces all factor matrices except two to be identity matrices, which can be expressed as:

\[\mathcal{K}\approx [\![\mathcal{C};\mathbf{U},\mathbf{V},\mathbf{I},\mathbf{I}]\!],k_{tsij}\approx\sum_{r_1=1}^{R_1}\sum_{r_2=1}^{R_2}c_{r_1r_2ij}u_{tr_1}v_{sr_2}\]

where \(R_1,R_2\) are the ranks that we assigns to the mode-1 and mode-2. In this paper, we perform Variational Bayesian matrix factorization (VBMF) on \(\mathcal{K}_{(1)}\) and \(\mathcal{K}_{(2)}\) to calculate \(R_1,R_2\).

We have the following equations:

\[z_{r_2hw}=\sum_{s=1}^{S} v_{sr_2}x_{shw},\quad z^\prime_{r_1h^\prime w^\prime}=\sum_{i=1}^D\sum_{j=1}^D\sum_{r_2=1}^{R_2} c_{r_1r_2ij}z_{r_2h_iw_j},\quad y_{h^\prime w^\prime t}=\sum_{r_1=1}^{R_1}u_{tr_1}z^\prime_{r_1h^\prime w^\prime}\]

which is equivalent to replacing the original convolutional layer with three convolutional layers where spatial operations only occur in the second layer.

\[\text{Conv2D } \mathcal{K}(T,S,D,D),\Delta,P\approx \left.\begin{aligned} &\text{Conv2D } \mathcal{V}(R_1,S,1,1),0,0\\ &\text{Conv2D } \mathcal{C}(R_2,R_1,D,D),\Delta,P\\ &\text{Conv2D } \mathcal{U}(T,R_2,1,1),0,0\\ \end{aligned}\right.\]

where \(\mathcal{V},\mathcal{U}\) is formed from \(\mathbf{V}^\top,\mathbf{U}\) by adding size-one dimension to both mode-3 and mode-4.

We can show that if we perform such compression on the convolutional layers, the compression ratio (number of parameters before divided by after) and speed-up ratio (number of multiplication-addition operations before divided by after) are expressed as:

\[\begin{aligned} \text{compression ratio}&=\frac{D^2ST}{SR_2+TR_1+D^2R_1R_2}\\ \quad\text{speed-up ratio}&=\frac{D^2STH^\prime W^\prime }{SR_2HW+TR_1H^\prime W^\prime +D^2R_1R_2H^\prime W^\prime } \end{aligned}\]

Code Structure

trainer.py: code for training and testing
tucker_layer.py: implementation of compression of a convolutional layer
decomposition.py: implementation of tucker decomposition algorithms: HOSVD and HOOI based on tensorly
vbmf.py: Variational Bayesian matrix factorization based on VBMF
AlexNet-MNIST.ipynb: apply compression on AlexNet with FashionMNIST data set
AlexNet-CIFAR10.ipynb: apply compression on AlexNet with CIFAR-10 data set

Usage

Please follow the jupyter notebooks. Cause the code will read/write files to the disk and download/upgrade libraries, a cloud/virtual environment with a high-performance GPU is highly recommended.

References

Tamara G Kolda and Brett W Bader. 2009. Tensor decompositions and applications. SIAM review 51, 3 (2009), 455–500.
Yong-Deok Kim, Eunhyeok Park, Sungjoo Yoo, Taelim Choi, Lu Yang, and Dongjun Shin. 2015. Compression of deep convolutional neural networks for fast and low power mobile applications. arXiv preprint arXiv:1511.06530 (2015).