Revisiting Deep Learning on CPU, New Optimizations Made to TensorFlow and Caffe

[ ]

Intel has been working hard to optimizing the popular deep learning frameworks, e.g., TensorFlow and Caffe, on (Intel) CPUs. Their work has reduced the performance gaps between CPU and GPU from about 80x to 2x.

TensorFlow* Optimizations on Modern Intel® Architecture

This is a joint work between Intel and Google.

Optimizing deep learning models performance on modern CPUs presents a number of challenges not very different from those seen when optimizing other performance-sensitive applications in High Performance Computing (HPC):

  • Code refactoring needed to take advantage of modern vector instructions. This means ensuring that all the key primitives, such as convolution, matrix multiplication, and batch normalization are vectorized to the latest SIMD instructions (AVX2 for Intel Xeon processors and AVX512 for Intel Xeon Phi processors).
  • Maximum performance requires paying special attention to using all the available cores efficiently. Again this means looking at - parallelization within a given layer or operation as well as parallelization across layers.
  • As much as possible, data has to be available when the execution units need it. This means balanced use of prefetching, cache blocking techniques and data formats that promote spatial and temporal locality.

To meet these requirements, Intel developed a number of optimized deep learning primitives that can be used inside the different deep learning frameworks to ensure that we implement common building blocks efficiently. In addition to matrix multiplication and convolution, these building blocks include:

  • Direct batched convolution
  • Inner product
  • Pooling: maximum, minimum, average
  • Normalization: local response normalization across channels (LRN), batch normalization
  • Activation: rectified linear unit (ReLU)
  • Data manipulation: multi-dimensional transposition (conversion), split, concat, sum and scale.

The result on Alexnet can be found here.

The installation is also very simple:

# Python 2.7
pip install https://anaconda.org/intel/tensorflow/1.3.0/download/tensorflow-1.3.0-cp27-cp27mu-linux_x86_64.whl

# Python 3.5
pip install https://anaconda.org/intel/tensorflow/1.3.0/download/tensorflow-1.3.0-cp35-cp35m-linux_x86_64.whl

# Python 3.6
pip install https://anaconda.org/intel/tensorflow/1.3.0/download/tensorflow-1.3.0-cp36-cp36m-linux_x86_64.whl

Intel® Optimized Caffe

Intel has been optimizing Caffe as well.

The is a comparison of original Caffe with optimized Caffe on Cifar-10.

The installation is simple as:

git clone https://github.com/intel/caffe

then change the configuration file:

BLAS :=mkl

BLAS_INCLUDE := /opt/intel/mkl/include
BLAS_LIB := /opt/intel/mkl/lib/intel64

The other compilation and installation steps are identical to original Caffe. The optimizations made includes:

  • Use ‘mkl’ as BLAS library : Specify ‘BLAS := mkl’ in Makefile.config and configure the location of your MKL’s include and lib location also.
  • Set CPU utilization limit :
    echo "100" | sudo tee /sys/devices/system/cpu/intel_pstate/min_perf_pct
    echo "0" | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
    
  • Put ‘engine:”MKL2017” ‘ at the top of your train_val.prototxt or solver.prototxt file or use this option with caffe tool : -engine “MKL2017”
  • Current implementation uses OpenMP threads. By default the number of OpenMP threads is set to the number of CPU cores. Each one thread is bound to a single core to achieve best performance results. It is however possible to use own configuration by providing right one through OpenMP environmental variables like KMP_AFFINITY, OMP_NUM_THREADS or GOMP_CPU_AFFINITY. For the example run below , ‘OMP_NUM_THREADS = 64’ has been used.
  • Intel® Optimized Caffe* has edited many parts of original BVLC Caffe* code to achieve better code parallelization with OpenMP. Depending on other processes running on the background, it is often useful to adjust the number of threads getting utilized by OpenMP. For Intel Xeon Phi™ product family single-node we recommend to use OMP_NUM_THREADS = numer_of_cores-2.
Written on October 13, 2017