# Logistic (Log) Loss Function

A Logistic (Log) Loss Function is a convex loss function that is defined as the negative log-likelihood of a logistic model that returns predicted probabilities for its training data.

**AKA:**Logistic (Log) Loss Function, Cross-Entropy Loss Function.**Context:**- input: $y_{train}$, training data.
- output: $y_{pred}$, a Log Loss Value (predicted probabilities for each training data value, it ranges from 0 to 1).
- It measures the performance of a classification model whose output is a probability value between 0 and 1.
- It can (often) be used for a Binary Classification Task with Predicted Probability.

**Example(s):**`theano.tensor.nnet.nnet.sigmoid_binary_crossentropy()`

- a Theano's implementation.`sklearn.metrics.log_loss()`

- SciKit Learn's implementation.- …

**Counter-Example(s):**- an Exponential Loss Function,
- a Hinge-Loss Function, as used by SVMs.
- a Huber Loss Function,
- a Kullback-Leibler Loss Function,
- a Savage Loss Function,
- a Square Loss Function,
- a Tangent Loss Function.

**See:**Squared Error Function, Cross-Entropy Measure, Mean Absolute Error, Mean Squared Error.

## References

### 2021a

- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Loss_functions_for_classification#Logistic_loss Retrieved:2021-3-7.
- The logistic loss function can be generated using (2) and Table-I as follows : \begin{align} \phi(v) &= C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)\, C'\left[f^{-1}(v)\right] \\ &= \frac{1}{\log(2)}\left [\frac{-e^v}{1+e^v}\log\frac{e^v}{1+e^v}-\left(1-\frac{e^v}{1+e^v}\right)\log\left(1-\frac{e^v}{1+e^v}\right)\right ]+\left(1-\frac{e^v}{1+e^v}\right) \left [\frac{-1}{\log(2)}\log\left(\frac{\frac{e^v}{1+e^v}}{1-\frac{e^v}{1+e^v}}\right)\right] \\ &=\frac{1}{\log(2)}\log(1+e^{-v}). \end{align} The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the LogitBoost algorithm.
The minimizer of I[f] for the logistic loss function can be directly found from equation (1) as : f^*_\text{Logistic}= \log\left(\frac{\eta}{1-\eta}\right)=\log\left(\frac{p(1\mid x)}{1-p(1\mid x)}\right). This function is undefined when p(1\mid x)=1 or p(1\mid x)=0 (tending toward ∞ and −∞ respectively), but predicts a smooth curve which grows when p(1\mid x) increases and equals 0 when p(1\mid x)= 0.5 .

It's easy to check that the logistic loss and binary cross entropy loss (Log loss) are in fact the same (up to a multiplicative constant \frac{1}{\log(2)} ). The cross entropy loss is closely related to the Kullback–Leibler divergence between the empirical distribution and the predicted distribution. The cross entropy loss is ubiquitous in modern deep neural networks.

- The logistic loss function can be generated using (2) and Table-I as follows : \begin{align} \phi(v) &= C[f^{-1}(v)]+\left(1-f^{-1}(v)\right)\, C'\left[f^{-1}(v)\right] \\ &= \frac{1}{\log(2)}\left [\frac{-e^v}{1+e^v}\log\frac{e^v}{1+e^v}-\left(1-\frac{e^v}{1+e^v}\right)\log\left(1-\frac{e^v}{1+e^v}\right)\right ]+\left(1-\frac{e^v}{1+e^v}\right) \left [\frac{-1}{\log(2)}\log\left(\frac{\frac{e^v}{1+e^v}}{1-\frac{e^v}{1+e^v}}\right)\right] \\ &=\frac{1}{\log(2)}\log(1+e^{-v}). \end{align} The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the LogitBoost algorithm.

### 2021b

- (Wikipedia, 2021) ⇒ https://en.wikipedia.org/wiki/Cross_entropy Retrieved:2021-3-7.
- In information theory, the
**cross-entropy**between two probability distributions and over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is optimized for an estimated probability distribution , rather than the true distribution p .

- In information theory, the

### 2021c

- (ML Glossary, 2021) ⇒ https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html Retrieved:2021-03-06.
- QUOTE: Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. (...).
Cross-entropy and log loss are slightly different depending on context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing.

**Code**def CrossEntropy(yHat, y):if y == 1:

return -log(yHat)

else:

return -log(1 - yHat)

In binary classification, where the number of classes $M$ equals 2, cross-entropy can be calculated as:

$−\left(y\log\left(p\right)+\left(1−y\right)\log\left(1−p\right)\right)$If $M>2$ (i.e. multiclass classification), we calculate a separate loss for each class label per observation and sum the result.

$−\displaystyle \sum_{c=1}^My_{o,c}\log\left(p_{o,c}\right)$

- QUOTE: Cross-entropy loss, or log loss, measures the performance of a classification model whose output is a probability value between 0 and 1. (...).

### 2018a

- (Fast AI, 2018a) ⇒ http://wiki.fast.ai/index.php/Log_Loss
- QUOTE: Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of 0.012 when the actual observation label is 1 would be bad and result in a high log loss. There is a more detailed explanation of the justifications and math behind log loss here. …
… To calculate log loss from scratch, we need to include the MinMax function (see below). Numpy implements this for us with np.clip()

def logloss(true_label, predicted, eps=1e-15):p = np.clip(predicted, eps, 1 - eps)

if true_label == 1:

return -log(p)

else:

return -log(1 - p)

- QUOTE: Logarithmic loss (related to cross-entropy) measures the performance of a classification model where the prediction input is a probability value between 0 and 1. The goal of our machine learning models is to minimize this value. A perfect model would have a log loss of 0. Log loss increases as the predicted probability diverges from the actual label. So predicting a probability of 0.012 when the actual observation label is 1 would be bad and result in a high log loss. There is a more detailed explanation of the justifications and math behind log loss here. …

### 2018b

- (DeepLearning,2018) ⇒ http://deeplearning.net/software/theano/library/tensor/nnet/nnet.html#theano.tensor.nnet.nnet.sigmoid_binary_crossentropy
- QUOTE: It is equivalent to binary_crossentropy(sigmoid(output), target), but with more efficient and numerically stable computation, especially when taking gradients.

### 2017a

- (WikiFastAI) ⇒ http://wiki.fast.ai/index.php/Log_Loss#Log_Loss_vs_Cross-Entropy
- QUOTE: Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. As a demonstration, where p and q are the sets p∈{y, 1−y} and q∈{ŷ, 1−ŷ} we can rewrite cross-entropy as:
- p = set of true labels
- q = set of prediction
- y = true label
- ŷ = predicted prob

- Which is exactly the same as log loss!

- QUOTE: Log loss and cross-entropy are slightly different depending on the context, but in machine learning when calculating error rates between 0 and 1 they resolve to the same thing. As a demonstration, where p and q are the sets p∈{y, 1−y} and q∈{ŷ, 1−ŷ} we can rewrite cross-entropy as:

### 2017b

- (Kaggle, 2017) ⇒ https://www.kaggle.com/c/bioresponse/discussion/1831
- QUOTE:

from math import log

def log_loss(predicted, target): if len(predicted) != len(target): print 'lengths not equal!' return target = [float(x) for x in target] # make sure all float values predicted = [min([max([x,1e-15]),1-1e-15]) for x in predicted] # within (0,1) interval return -(1.0/len(target))*sum([target[i]*log(predicted[i]) + \ (1.0-target[i])*log(1.0-predicted[i]) \ for i in xrange(len(target))])

if __name__=='__main__': # if you run at the command line as 'python utils.py' actual = [0, 1, 1, 1, 1, 0, 0, 1, 0, 1] pred = [0.24160452, 0.41107934, 0.37063768, 0.48732519, 0.88929869, 0.60626423, 0.09678324, 0.38135864, 0.20463064, 0.21945892] print log_loss(pred,actual)

### 2016

- (Program Creek, 2016) ⇒ https://www.programcreek.com/python/example/86075/sklearn.metrics.log_loss
- QUOTE

def log_loss(solution, prediction, task = 'binary.classification'):Log loss for binary and multiclass.[sample_num, label_num] = solution.shape eps = 1e-15

pred = np.copy(prediction) # beware: changes in prediction occur through this sol = np.copy(solution) if (task == 'multiclass.classification') and (label_num>1): # Make sure the lines add up to one for multi-class classification norma = np.sum(prediction, axis=1) for k in range(sample_num): pred[k,:] /= sp.maximum (norma[k], eps) # Make sure there is a single label active per line for multi-class classification sol = binarize_predictions(solution, task='multiclass.classification') # For the base prediction, this solution is ridiculous in the multi-label case

# Bounding of predictions to avoid log(0),1/0,... pred = sp.minimum (1-eps, sp.maximum (eps, pred)) # Compute the log loss pos_class_log_loss = - mvmean(sol*np.log(pred), axis=0) if (task != 'multiclass.classification') or (label_num==1): # The multi-label case is a bunch of binary problems. # The second class is the negative class for each column. neg_class_log_loss = - mvmean((1-sol)*np.log(1-pred), axis=0) log_loss = pos_class_log_loss + neg_class_log_loss # Each column is an independent problem, so we average. # The probabilities in one line do not add up to one. # log_loss = mvmean(log_loss) # print('binary {}'.format(log_loss)) # In the multilabel case, the right thing i to AVERAGE not sum # We return all the scores so we can normalize correctly later on else: # For the multiclass case the probabilities in one line add up one. log_loss = pos_class_log_loss # We sum the contributions of the columns. log_loss = np.sum(log_loss) #print('multiclass {}'.format(log_loss)) return log_loss

### 2015

- (SciKit-Learn, 2015) ⇒ http://scikit-learn.org/stable/modules/generated/sklearn.metrics.log_loss.html
- Log loss, aka logistic loss or cross-entropy loss.
- QUOTE: This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier’s predictions. For a single sample with true label yt in {0,1} and estimated probability yp that yt = 1, the log loss is : [math]\displaystyle{ -\log P(yt|yp) = -(yt log(yp) + (1 - yt) log(1 - yp)) }[/math]

### 2014

- (Kaggle, 2014) ⇒ https://www.kaggle.com/wiki/LogarithmicLoss
- QUOTE: [math]\displaystyle{ \operatorname{log loss} = -\frac{1}{N}\sum_{i=1}^N\sum_{j=1}^My_{ij}\log(p_{ij}) }[/math]