# Image Segmentation

[deep-learning

image-segmentation

fcn

unet

segnet

dilated-convolutions

refinenet

pspnet

deeplab

mask-rcnn

skip

scribble

]
There are two types of image segmentation:

- semantic segmentation: assign labels to each pixel
- instance segmentation: extract the bounary/mask for each instance in the image

# Examples

- Fully Convolution Neural Network (FCN): FCN convert the fully layers in traditional network used for image classifcation, e.g., Alexnet, VGG16, to convolution layer. Thus is could generate a probability map of each pixel.
- UNet: it combines two parts: left part uses convolution and max-pooling to extract feature; right part uses upsampling and skip (input from lower layer of left part) to generate the label map.
- SegNet: similar to UNet, but it doesn’t use skip to combine the input from lower layer of left part (refered encoder network).
- Dilated Convolutions: the problem in FCN is that, using pooling and then up-sampling will cause data loss. Dilate convolution resolves this problem by adding
`dilate`

to convolution, which increases the field size while doesn’t reduce the output size as pooling dose. - RefineNet: similar to UNet, but utilizes the ResNet as the base.
- PspNet: is applies the idea of spatial pyramid pooling to image segmentation,
- DeepLab: combines Atrous Convolution (similar to Dilated Convolutions) with PspNet.
- Mask-R-CNN: uses the idea of object detection for semantic segmentation, where the probability of each boundary box is used as a response map, where softmax is then applied to generate the mask.

# Common Techniques

- transposed convolution: a conjugate pair of convolution operator, whose forward propgation is the backward propagation of convolution operation and vice versa
- skip: combine the output of intermidiate layers to have multiple level features

# Labels for Training Data

- Scribble uses a few simple scribbles as the label of the training image. Cost function is \(\sum_{i}\psi _i^{scr}\left(y_i|X,S\right)+\sum i-logP\left(y_i| X,\theta\right)+\sum{i,j}\psi _{ij}\left(y_i,y_j|X\right)\)
- Image-level label the label if provides to image level and there is no pixel level label, like image classificaition case. Cost function is \(\underset{\theta ,P}{minimize}\qquad D(P(X)||Q(X|\theta ))\ subject\to\qquad A\overrightarrow{P} \geqslant \overrightarrow{b},\sum_{X}^{ }P(X)=1\)
- Bounding box and label: the label is some bounding boxes and their labels, as object detection case. Cost function is \(P\left ( x,y,z;\theta \right ) = P\left ( x \right )\left (\prod_{m=1}^{M} P\left ( y_m|x;\theta \right )\right )P\left ( z|y \right )\)

Written on April 2, 2019