Network Compression

[ deep-learning  network-pruning  quantization  bit-compression  low-rank-approximation  sparsity  seperable-filter  squeezenet  mobilenet  xception  ]

The trained network is typically too large to run efficiently on mobile device. For example, VGG16 used for image classification has more 130 Million parameter (about 600 MB on model size) and requires about 31 billion operations to classify an image, which is way to expensive to be done on mobile.

We need to shrink the network, which will be introduced below.

Network Prunning

This method tries to reduce the size of network, which could be

  • removing some layers: ThiNet
  • removing some channels: Channel Pruning, Discrimination-aware Channel Pruning
  • removing single filters: Deep Compression, Sparse-Winograd, Neuron Pruning

Distill the knowledge in neural network

This method is trying to train a small network to mimick a large network.

Examples include SqueezeNet, MobileNet

Low rank approximation

Major of the computation in neural network training and inference is matrix production. For matrix production, , if $A \in R^{m \times n}$ can be written as product of A1 and A2 where $A_1\in R^{m \times k}$ and $A_2\in R^{k \times n}$ with $k \lt m, n$, then we can write $A \times B = A_1 \times (A_2 \times B)$. The computational can be reduced by a factor of $\frac{2 \times k}{m + n}$

Bit Quantization

It has been shown in experiment that, for inference, 32 bit single precision float is not required, 16 bit or even 8 bit fixed point is sufficient. Some experiment even pushes to 4 bit or just 1 bit. Note, the mapping from 32 bit to 8 bit may not be linear.


  • 1 bit: XNORnet, ABCnet with Multiple Binary Bases, Bin-net with High-Order Residual Quantization, Bi-Real Net
  • 2 bit: Ternary weight networks, Trained Ternary Quantization
  • 8 bit: Learning Symmetric Quantization
  • 8 bit int: TensorFlow-lite, TensorRT
  • nonlinear: Intel INQ, log-net, CNNPack


The multiplication of a number to zero is always zero. If the parameter of the network contains a lot of zero, then we can save a lot of multiplication operation. We can set the parameter which are small enough to zero and roughly maintain the performance of the network. Sparsity constraint can be also added during the training stage to encourage the trainied network to be sparse.

Separable Filter

A seperate filter means the 2D convolution can be done as a sequence of two 1D convolution: . The cost ratio would be:



SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and < 0.5MB model size

It bases on three technique:

  • uses 1x1 convolution instead of 3x3 one;
  • reduce the input channel number for 3x3 convolution;
  • avoid stride in convolution layer


MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

IT bases on uses depth wise convolution and 1x1 convolution

MobileNets V2

Inverted Residuals and Linear Bottlenecks: Mobile Networks for Classification, Detection and Segmentation


Xception: Deep Learning with Depthwise Separable Convolutions

It is an optimization over Inception V3 via using depthwise separable convolution.


ShuffleNet: An Extremely Efficient Convolutional Neural Network for Mobile Devices

It is similar to mobile-net.


ShuffleNet V2: Practical Guidelines for Ecient CNN Architecture Design


EffNet: An Efficient Structure for Convolutional Neural Networks

EffNet uses spatial separable convolutions. It is very similar to MobileNet’s depthwise separable convolutions. The separable depthwise convolution is the rectangle colored in blue for EffNet block. It is made of depthwise convolution with a line kernel (1x3), followed by a separable pooling, and finished by a depthwise convolution with a column kernel (3x1).

Written on April 3, 2019