Anchor Free Object Detection

[ deep-learning  object-detection  anchor-free  centernet  cornernet  extremenet  unitbox  ssd  yolo  densebox  psaf  fcos  foveabox  ga-rpn  ]

The most sucessfull single stage object detection algorithms, e.g., YOLO, SSD, all relies all some anchor to refine to the final detection location. For those algorithms, the anchor are typically defined as the grid on the image coordinates at all possible locations, with different scale and aspect ratio.

Though much faster than their two-stage counterparts, single stage algorithms’ speed and performance is still limited by the choice of the anchor boxes: fewer than anchor leads better speed but deteroiates the accuracy. As a result, many new works are trying to design anchor free object detection algorithms.

UnitBox: An Advanced Object Detection Network

UnitBox uses Intersection over Union (IoU) loss function for bounding box prediction.

DenseBox: Unifying Landmark Localization and Object Detection

DenseBox directly compute the bounding box and its label from the feature map.

CornerNet: Detecting Objects as Paired Keypoints

In CornerNet, the bounding box is uniquely defined by its top-left corner and bottom-right corner, which is detected by each of the two branches. Corner-pooling is applied to detect the corners, which utilizes the ideas of integral image (see below)

ExtremeNet: Bottom-up Object Detection by Grouping Extreme and Center Points

Similar as CornerNet, it formulates the problem of finding bounding box as finding some corner points. But instead of two corners as in CornerNet, it requires four corner points and one center point, which is computed via peaks of heatmaps of each corner points.

FSAF: Feature Selective Anchor-Free

It is based on feature pyramid network, where the final result is dynamically selected from the optimal resolution.

FCOS: Fully Convolutional One-Stage

FCOS is anchor-box free, as well as proposal free. FCOS works by predicting a 4D vector (l, t, r, b) encoding the location of a bounding box at each foreground pixel (supervised by ground-truth bounding box information during training).

This done in a per-pixel prediction way, i.e., for each pixel, the network try to predict a bounding box from it, together with the label of class. To counter for the pixel which are far from the ground truth object (center), a centerness score is also predicted which downweights the prediction for those pixels.

If a location falls into multiple bounding boxes, it is considered as an ambiguous sample. For now, we simply choose the bounding box with minimal area as its regression target.

Feature Pyramid Network is used as the backbone.

FoveaBox: Beyond Anchor-based Object Detector

It is very similar to FCOS.

Region Proposal by Guided Anchoring(GA-RPN)

In GA-RPN, the anchor (defined as a tuple of its location and shape) is learned instead of manually defined. Then feature extraction is then adapted to this computed anchor. CenterNet: Objects as Points

CenterNet: Objects as Points

CenterNet defines the bounding box by its center. After the center is computed, its shape and pose can be further computed.

CenterNet: Object Detection with Keypoint Triplets

It is based on CenterNet but very similar to ExtremeNet or CornerNet, where the bounding box is now defined by a pair of corner points and the label is defined by the response of the center point.

CornerNet-Lite: Efficient Keypoint Based Object Detection

CornerNet-Lite:CornerNet-Saccade(attention mechanism)+ CornerNet-Squeeze

[Center and Scale Prediction: A Box-free Approach for Object Detection]

As GA-RPN, the bounding box is defined by its center and shape, which is computed from two branches of the neural network.

Written on May 22, 2019