# Image Segmentation

There are two types of image segmentation:

• semantic segmentation: assign labels to each pixel
• instance segmentation: extract the bounary/mask for each instance in the image

# Examples

• Fully Convolution Neural Network (FCN): FCN convert the fully layers in traditional network used for image classifcation, e.g., Alexnet, VGG16, to convolution layer. Thus is could generate a probability map of each pixel.
• UNet: it combines two parts: left part uses convolution and max-pooling to extract feature; right part uses upsampling and skip (input from lower layer of left part) to generate the label map.
• SegNet: similar to UNet, but it doesn’t use skip to combine the input from lower layer of left part (refered encoder network).
• Dilated Convolutions: the problem in FCN is that, using pooling and then up-sampling will cause data loss. Dilate convolution resolves this problem by adding dilate to convolution, which increases the field size while doesn’t reduce the output size as pooling dose.
• RefineNet: similar to UNet, but utilizes the ResNet as the base.
• PspNet: is applies the idea of spatial pyramid pooling to image segmentation,
• DeepLab: combines Atrous Convolution (similar to Dilated Convolutions) with PspNet.
• Mask-R-CNN: uses the idea of object detection for semantic segmentation, where the probability of each boundary box is used as a response map, where softmax is then applied to generate the mask.

# Common Techniques

• transposed convolution: a conjugate pair of convolution operator, whose forward propgation is the backward propagation of convolution operation and vice versa
• skip: combine the output of intermidiate layers to have multiple level features

# Labels for Training Data

• Scribble uses a few simple scribbles as the label of the training image. Cost function is $$\sum_{i}\psi _i^{scr}\left(y_i|X,S\right)+\sum i-logP\left(y_i| X,\theta\right)+\sum{i,j}\psi _{ij}\left(y_i,y_j|X\right)$$
• Image-level label the label if provides to image level and there is no pixel level label, like image classificaition case. Cost function is $$\underset{\theta ,P}{minimize}\qquad D(P(X)||Q(X|\theta ))\ subject\to\qquad A\overrightarrow{P} \geqslant \overrightarrow{b},\sum_{X}^{ }P(X)=1$$
• Bounding box and label: the label is some bounding boxes and their labels, as object detection case. Cost function is $$P\left ( x,y,z;\theta \right ) = P\left ( x \right )\left (\prod_{m=1}^{M} P\left ( y_m|x;\theta \right )\right )P\left ( z|y \right )$$
Written on April 2, 2019