# Loss Functions

[deep-learning

loss

cross-entropy

constrastive

triplet

hinge

]
Loss functions are frequently used in supervised machine learning to minimize the differences between the predicted output of the model and the ground truth labels. In other words, it is used to measure how good our model can predict the true class of a sample from the dataset. Here I would like to list some frequently-used loss functions and give my intuitive explanation.

# Cross Entropy Loss

Cross Entropy Loss is usually used in classification problems. In essence, it is a measure of difference between the desired probablity distribution and the predicted probablity distribution. Suppose the classification is binary classification problem, the label $y$ are 0, 1. Then the loss function for a single sample in the dataset is expressed as:

\[-ylog(x)-(1-y)log(1-x)\]For K-class (K>2) classification problems, the predicted probablity output for a single sample is a vector of length K: $[p_0, p_1,…,p_K]$. The Cross Entropy Loss is extended as:

\[\sum_k{-log(p_k)}\]# Hinge Loss

Hinge loss is also widely used for classification problem, especially in support vector machine:

\[\max{(\epsilon - y f(x), 0)}\]# Contrastive Loss

Contrastive Loss is often used in image retrieval tasks to learn discriminative features for images. During training, an image pair is fed into the model with their ground truth relationship y: y equals 1 if the two images are similar and 0 otherwise. The loss function for a single pair is:

\[\sum_{i,j}{y_{i,j} \lvert x_i - x_j\rvert^2 + (1-y_{i,j})\max{(\epsilon - \lvert x_i - x_j\rvert^2, 0)}}\]An extension to constrative loss is allowing the label to be 1 if $x_1$ is better than $x_2$, -1 if $x_1$ is worse than $x_2$ or 0 if $x_1$ is similar as $x_2$:

\[\sum_{(i,j)\in y_{i,j}=0}{\lvert x_i - x_j\rvert} + \sum_{(i,j)\in y_{i,j}\neq0}\max{(\epsilon - y_{i, j}( x_i - x_j), 0)}\]# Triplet Loss

Triplet Loss is another loss commonly used in CNN-based image retrieval. During training process, an image triplet $(x_a,x_n,x_p)$ is fed into the model as a single sample, where $x_a,x_n,x_p$ represent the anchor, postive and negative images respectively. The idea behind is that distance between anchor and positive images should be smaller than that between anchor and negative images. The formal definition is:

\[max(\lVert f(x_a) - f(x_p)\rVert_2^2 - \lVert f(x_a) - f(x_n)\rVert_2^2 + \epsilon, 0)\]