Face Landmark Detection

[ deep-learning  face  face-landmark  face-landmark-detection  ]

Facial landmark detection algorithms aim to auto- matically identify the locations of the facial key land- mark points on facial images or videos. Those key points are either the dominant points describing the unique lo- cation of a facial component (e.g., eye corner) or an interpolated point connecting those dominant points around the facial components and facial contour.

The facial landmark detection algorithms can be divided into three major categories, according to their ways to utilize the facial appearance and shape information:

  • holistic methods: explicitly build models to represent the global facial appearance and shape information
  • Constrained Local Model (CLM) methods: explicitly leverage the global shape model but build the local appearance models
  • the regression-based methods: implicitly capture facial shape and appearance information.

This note will focus on the regression-based methods, especially those based on deep learning.

Deep convolutional network cascade for facial point detection

This paper (Cascaded CNN) proposes a three-stage neural network to predict the facial landmark locations, where each stage of neural network takes the output of all neural network (each network covers one or several landmarks) of previous stage, thus each landmark location detection will receive the context information of the whole face.

The first stage (takes large portion as input) is able to generate high precision prediction and the next two stages (take patches centered at the predicition of first stage) further refines the prediction.

Note the convolution layer used in this Cascaded CNN is slightly different from typical ones: each map in the convolutional layer is evenly divided into p by q regions, and weights are locally shared in each region.

Facial landmark detection by deep multi-task learning

This work (TCDCN) proposes to we to optimize facial landmark detection together with heterogeneous but subtly correlated tasks, e.g.head pose estimation and facial attribute inference. This is motivated by one task (e.g., head pose estimation) could help others.

To address different tasks may have different learning difficulties and convergence rate, task-wise early stop criterion is proposed, which is based on training error and validation error.

Compare with Cascaded CNN mentioned above, the TCDCN has only one network and predict all landmarks together.


HyperFace also tries to solve face detection, landmarks localization, pose estimation and gender recognition jointly, but by fusing the intermediate layers of the network. This is based on observation that while the lower layer features are effective for landmarks localization and pose estimation, the higher layer features are suitable for more complex tasks such as detection or classification

The network is based on R-CNN.

Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment

CFAN also proposes a cascade neural network but instead of using CNN it uses Auto-Encoder. The first stage predicts the landmark location quickly from low resolution images; the next stages refines the prediction of previous stage with higher resolution images.

The first stage uses the whole face image as input and generates the locations of all landmarks. The next stage uses the patches extract around each landmark and concates them together, to explore the face model constraint.

Mnemonic descent method: A recurrent process applied for end-to-end face alignment

This paper proposes a combine and jointly trained recurrent convolution neural network (MDM) for face landmark detection, where convolution modules extracts feature for each landmark, and recurrent module utilizes the information cross cascades.

Especially, at the first timestamp, the inputs are patches extractes from mean face model, convolution modules are applied to extract features from those inputs; the recurrent module computes the landmark location offsets given the feature. At next timestamp, the landmark locations are updated according to the offset.

Face alignment across large poses: A 3d solution

The proposed method 3DDFA addresses the face landmark detection for large pose variations. The major challenge is under large pose variations, not all face landmarks will be visible, which is addressed by fitting a 3D face model to a 2D face image.

The 3D face model is based on 3D morphable model (3DM-M): where $\bar{S}$ is mean face model, $A$ is the principal compoennts computed from faces with neural expression and $B$ counts for expression change.

Written on April 15, 2019