Siamese Network Based Single Object Tracking

[ cir  rasnet  c-rpn  sint++  cfnet  siamfc  deep-learning  dsiam  mbst  siammask  single-object-tracking  dasiamrpn  structsiam  sint  siamrpn  siamfc-tri  siamese  sa-siam  siam-bm  densesiam  ]

Siamese network is an artificial neural network that use the same weights while working in tandem on two different input vectors to compute comparable output vectors.

Partially based on 基于孪生网络的目标跟踪算法汇总


The figure belows summaries the history of Siamese network based trackrs.


project page and code

This is the first work proposing to use Siamese network for visual tracking.


Fully-Convolutional Siamese Networks for Object Tracking, Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, Philip H. S. Torr., The European Conference on Computer Vision (ECCV) Workshops, 2016. project code

SiamFC formulates the tracking problem into a similarity learning problem, for which a fully convolutional Siamese network is trained to locate an exemplar image within a larger search image. The network compares an exemplar image z (the initial appearance of the object) to a candidate image x of the same size and returns a high score if the two images depict the same object and a low score otherwise. The backbone is Alexnet.

For training, pairs are obtained from a dataset of annotated videos by extracting exemplar and search images that are centred on the target. The images are extracted from two frames of a video that both contain the object and are at most T frames apart. The elements of the score map are considered to belong to a positive example if they are within radius R of the centre. Note the network is learned offline and does not update via tracking.


project and code

The improvement over SiamFC of CFNet is to have a dedicated layer for correlation operation.



Compared with SiamFC, DSiam introduces two online fast transformations, i.e. target variation transformation and background suppression transformation, which makes SiamFC adapt target changes while excluding background interferences.



SINT++ improves SINT via introducing AutoEncoder and GAN to generate variations of positive sample to enhance the robustness of the tracker.


Two networks, namely A-Net and S-NET, are used to extract the appearance feature and semantic feature accordingly. In addition, an attention mechanism is introduced.


RASNet further explores the idea of the attention mechanism. It utilizes three attention mechenisms to remove the requirement of model online update during tracking:

  • residual attentions: distinctiveness of current tracking target over the common samples
  • general attentions: learned from all training samples
  • channel attentions: selecting semantic attributes for different contexts

The backbone of the attention module in the RASNet is an Hourglass-like Convolutional Neural Network (CNN) model to learn contextualized and multi-scaled feature representation.


High Performance Visual Tracking with Siamese Region Proposal Network, Bo Li, Wei Wu, Zheng Zhu, Junjie Yan, Xiaolin Hu., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. project code in pytorch code in tensorflow

SiamRPN focuses on the speed of the deep neural network based tracking algorithm. It consists of Siamese subnetwork for feature extraction and region proposal subnetwork including the template branch and detection branch. In the inference phase, the proposed framework is formulated as a local one-shot detection task, where the bounding box in the first frame is the only exemplar.

Similar as Faster-RCNN, the template branch predicts foreground/background and the regression branch compute the bounding box offset for each of the k anchor points. The template branch of the Siamese subnetwork is pre-computed and the correlation layers (denoted as star) as trivial convolution layers is used to perform online tracking.

Alexnet is used as the backbone.



Similar as SiamFC, but introduces triplet loss.



StructSiam proposes a local structure learning method, which simultaneously considers the local patterns of the target and their structural relationships for more accurate target tracking. To this end, a local pattern detection module is designed to automatically identify discriminative regions of the target objects.


Distractor-aware Siamese Networks for Visual Object Tracking, Zheng Zhu, Qiang Wang, Bo Li, Wu Wei, Junjie Yan, Weiming Hu., The European Conference on Computer Vision (ECCV), 2018. code

DaSiamRPN considers that features used in most Siamese tracking approaches can only discriminate foreground from the non-semantic backgrounds, whereas the semantic backgrounds are always considered as distractors. To solve this problem, an effective sampling strategy is introduced in the training stage to address the imbalanced distribution of training data. During inference, a novel distractor-aware module is designed to perform incremental learning.

Sampling strategies:

  • Diverse categories of positive pairs can promote the generalization ability, which can be achieved via data augmentation on dataset for object detection tasks.
  • Semantic negative pairs can improve the discriminative ability. For existing algorithms, most negative samples are non-semantic (not real object, just background), and they can be easily classified. To address this, the constructed negative pairs consist of labelled targets both in the same cate- gories and different categories.
  • Customizing effective data augmentation for visual tracking, e.g., motion blur.

During inference, the detection with the highest score is selected as the tracking result, whereas others detections with sufficient high score is used as hard negative samples, which are then used to incrementally update the tracker. For long term tracking, the faliure case (object out of scene) need to handled, where an iterative local-to-global search strategy is designed to re-detect the target.



DensSiam, a novel convolutional Siamese architecture, which uses the concept of dense layers (similar as DenseNet) and connects each dense layer to all layers in a feed-forward fashion with a similarity-learning function. DensSiam also includes a Self-Attention mechanism to force the network to pay more attention to the non-local features during offline training.



MBST learns multiple Siamese network and uses a selection module to select the network on the fly during inference.


project code

Siam-BM addresses the scale change and rotation during tracking by augment the query patches via rotation and scaling.


C-RPN applies the cascading idea to SiamRPN.


SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks, Bo Li, Wei Wu, Qiang Wang, Fangyi Zhang, Junliang Xing, Junjie Yan., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

This paper studies how to apply modern neural network backbone (e.g., ResNet) for siamese network based tracker. It is achieved via a simple yet effective spatial aware sampling strategy. Moreover, we propose a new model architecture to perform layer-wise and depth-wise aggregations, which not only further improves the accuracy but also reduces the model size.

When using deep networks for siamese network based tracker, the decrease in accuracy comes from the destroy of the strict translation invariance because of padding for convolution. To address this, spatial aware sampling strategy is introduced.



This paper discusses the problem of using modern super deep network, e.g., ResNet, for tracking problem. It is similar as SiamRPN++.


Fast Online Object Tracking and Segmentation: A Unifying Approach, Qiang Wang, Li Zhang, Luca Bertinetto, Weiming Hu, Philip H.S. Torr., IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. project code

SiamMask formulates the problems of visual tracking and visual object segmentation as a joint learning of three tasks:

  • to learn a measure of similarity between the target object and multiple candidates in a sliding window fashion
  • bounding box regression using a Region Proposal Network
  • class-agnostic binary segmentation: binary labels are only required during offline training to compute the segmentation loss and not online during segmentation/tracking

Once trained, SiamMask solely relies on a single bounding box initialisation, operates online without updates and produces object segmentation masks and rotated bounding boxes at 55 frames per second.


PySOT is a software system designed by SenseTime Video Intelligence Research team. It implements state-of-the-art single object tracking algorithms, including SiamRPN and SiamMask. It is written in Python and powered by the PyTorch deep learning framework. This project also contains a Python port of toolkit for evaluating trackers.

PySOT includes implementations of the following visaul tracking algorithms:

using the following backbone network architectures:

Additional backbone architectures may be easily implemented. For more details about these models, please see References below.

Evaluation toolkit can support the following datasets:

Written on May 15, 2019