Image Segmentation

[ deep-learning  image-segmentation  fcn  unet  segnet  dilated-convolutions  refinenet  pspnet  deeplab  mask-rcnn  skip  scribble  ]

There are two types of image segmentation:

  • semantic segmentation: assign labels to each pixel
  • instance segmentation: extract the bounary/mask for each instance in the image


  • Fully Convolution Neural Network (FCN): FCN convert the fully layers in traditional network used for image classifcation, e.g., Alexnet, VGG16, to convolution layer. Thus is could generate a probability map of each pixel.
  • UNet: it combines two parts: left part uses convolution and max-pooling to extract feature; right part uses upsampling and skip (input from lower layer of left part) to generate the label map.
  • SegNet: similar to UNet, but it doesn’t use skip to combine the input from lower layer of left part (refered encoder network).
  • Dilated Convolutions: the problem in FCN is that, using pooling and then up-sampling will cause data loss. Dilate convolution resolves this problem by adding dilate to convolution, which increases the field size while doesn’t reduce the output size as pooling dose.
  • RefineNet: similar to UNet, but utilizes the ResNet as the base.
  • PspNet: is applies the idea of spatial pyramid pooling to image segmentation,
  • DeepLab: combines Atrous Convolution (similar to Dilated Convolutions) with PspNet.
  • Mask-R-CNN: uses the idea of object detection for semantic segmentation, where the probability of each boundary box is used as a response map, where softmax is then applied to generate the mask.

Common Techniques

  • transposed convolution: a conjugate pair of convolution operator, whose forward propgation is the backward propagation of convolution operation and vice versa
  • skip: combine the output of intermidiate layers to have multiple level features

Labels for Training Data

  • Scribble uses a few simple scribbles as the label of the training image. Cost function is \(\sum_{i}\psi _i^{scr}\left(y_i|X,S\right)+\sum i-logP\left(y_i| X,\theta\right)+\sum{i,j}\psi _{ij}\left(y_i,y_j|X\right)\)
  • Image-level label the label if provides to image level and there is no pixel level label, like image classificaition case. Cost function is \(\underset{\theta ,P}{minimize}\qquad D(P(X)||Q(X|\theta ))\ subject\to\qquad A\overrightarrow{P} \geqslant \overrightarrow{b},\sum_{X}^{ }P(X)=1\)
  • Bounding box and label: the label is some bounding boxes and their labels, as object detection case. Cost function is \(P\left ( x,y,z;\theta \right ) = P\left ( x \right )\left (\prod_{m=1}^{M} P\left ( y_m|x;\theta \right )\right )P\left ( z|y \right )\)
Written on April 2, 2019