# What function defines accuracy in Keras when the loss is mean squared error (MSE)?

How is Accuracy defined when the loss function is mean square error? Is it mean absolute percentage error?

The model I use has output activation linear and is compiled with `loss= mean_squared_error`

``````model.add(Dense(1))

``````

and the output looks like this:

``````Epoch 99/100
1000/1000 [==============================] - 687s 687ms/step - loss: 0.0463 - acc: 0.9689 - val_loss: 3.7303 - val_acc: 0.3250
Epoch 100/100
1000/1000 [==============================] - 688s 688ms/step - loss: 0.0424 - acc: 0.9740 - val_loss: 3.4221 - val_acc: 0.3701
``````

So what does e.g. val_acc: 0.3250 mean? Mean_squared_error should be a scalar not a percentage - shouldnt it? So is val_acc - mean squared error, or mean percentage error or another function?

From definition of MSE on wikipedia:https://en.wikipedia.org/wiki/Mean_squared_error

The MSE is a measure of the quality of an estimator—it is always non-negative, and values closer to zero are better.

Does that mean a value of `val_acc: 0.0` is better than `val_acc: 0.325`?

edit: more examples of the output of accuracy metric when I train - where the accuracy is increase as I train more. While the loss function - mse should decrease. Is Accuracy well defined for mse - and how is it defined in Keras?

``````lAllocator: After 14014 get requests, put_count=14032 evicted_count=1000 eviction_rate=0.0712657 and unsatisfied allocation rate=0.071714
1000/1000 [==============================] - 453s 453ms/step - loss: 17.4875 - acc: 0.1443 - val_loss: 98.0973 - val_acc: 0.0333
Epoch 2/100
1000/1000 [==============================] - 443s 443ms/step - loss: 6.6793 - acc: 0.1973 - val_loss: 11.9101 - val_acc: 0.1500
Epoch 3/100
1000/1000 [==============================] - 444s 444ms/step - loss: 6.3867 - acc: 0.1980 - val_loss: 6.8647 - val_acc: 0.1667
Epoch 4/100
1000/1000 [==============================] - 445s 445ms/step - loss: 5.4062 - acc: 0.2255 - val_loss: 5.6029 - val_acc: 0.1600
Epoch 5/100
783/1000 [======================>.......] - ETA: 1:36 - loss: 5.0148 - acc: 0.2306
``````

62

There are at least two separate issues with your question.

The first one should be clear by now from the comments by Dr. Snoopy and the other answer: accuracy is meaningless in a regression problem, such as yours; see also the comment by patyork in this Keras thread. For good or bad, the fact is that Keras will not "protect" you or any other user from putting not-meaningful requests in your code, i.e. you will not get any error, or even a warning, that you are attempting something that does not make sense, such as requesting the accuracy in a regression setting.

Having clarified that, the other issue is:

Since Keras does indeed return an "accuracy", even in a regression setting, what exactly is it and how is it calculated?

To shed some light here, let's revert to a public dataset (since you do not provide any details about your data), namely the Boston house price dataset (saved locally as `housing.csv`), and run a simple experiment as follows:

``````import numpy as np
import pandas
import keras

from keras.models import Sequential
from keras.layers import Dense

dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

model = Sequential()
# Compile model asking for accuracy, too:

model.fit(X, Y,
batch_size=5,
epochs=100,
verbose=1)
``````

As in your case, the model fitting history (not shown here) shows a decreasing loss, and an accuracy roughly increasing. Let's evaluate now the model performance in the same training set, using the appropriate Keras built-in function:

``````score = model.evaluate(X, Y, verbose=0)
score
# [16.863721372581754, 0.013833992168483997]
``````

The exact contents of the `score` array depend on what exactly we have requested during model compilation; in our case here, the first element is the loss (MSE), and the second one is the "accuracy".

At this point, let us have a look at the definition of Keras `binary_accuracy` in the `metrics.py` file:

``````def binary_accuracy(y_true, y_pred):
return K.mean(K.equal(y_true, K.round(y_pred)), axis=-1)
``````

So, after Keras has generated the predictions `y_pred`, it first rounds them, and then checks to see how many of them are equal to the true labels `y_true`, before getting the mean.

Let's replicate this operation using plain Python & Numpy code in our case, where the true labels are `Y`:

``````y_pred = model.predict(X)
l = len(Y)
acc = sum([np.round(y_pred[i])==Y[i] for i in range(l)])/l
acc
# array([0.01383399])
``````

Well, bingo! This is actually the same value returned by `score` above...

To make a long story short: since you (erroneously) request `metrics=['accuracy']` in your model compilation, Keras will do its best to satisfy you, and will return some "accuracy" indeed, calculated as shown above, despite this being completely meaningless in your setting.

There are quite a few settings where Keras, under the hood, performs rather meaningless operations without giving any hint or warning to the user; two of them I have happened to encounter are:

• Giving meaningless results when, in a multi-class setting, one happens to request `loss='binary_crossentropy'` (instead of `categorical_crossentropy`) with `metrics=['accuracy']` - see my answers in Keras binary_crossentropy vs categorical_crossentropy performance? and Why is binary_crossentropy more accurate than categorical_crossentropy for multiclass classification in Keras?

• Disabling completely Dropout, in the extreme case when one requests a dropout rate of 1.0 - see my answer in Dropout behavior in Keras with rate=1 (dropping all input units) not as expected

Tuesday, June 1, 2021

41

There is just one cross (Shannon) entropy defined as:

``````H(P||Q) = - SUM_i P(X=i) log Q(X=i)
``````

In machine learning usage, `P` is the actual (ground truth) distribution, and `Q` is the predicted distribution. All the functions you listed are just helper functions which accepts different ways to represent `P` and `Q`.

There are basically 3 main things to consider:

• there are either 2 possibles outcomes (binary classification) or more. If there are just two outcomes, then `Q(X=1) = 1 - Q(X=0)` so a single float in (0,1) identifies the whole distribution, this is why neural network in binary classification has a single output (and so does logistic regresssion). If there are K>2 possible outcomes one has to define K outputs (one per each `Q(X=...)`)

• one either produces proper probabilities (meaning that `Q(X=i)>=0` and `SUM_i Q(X=i) =1` or one just produces a "score" and has some fixed method of transforming score to probability. For example a single real number can be "transformed to probability" by taking sigmoid, and a set of real numbers can be transformed by taking their softmax and so on.

• there is `j` such that `P(X=j)=1` (there is one "true class", targets are "hard", like "this image represent a cat") or there are "soft targets" (like "we are 60% sure this is a cat, but for 40% it is actually a dog").

Depending on these three aspects, different helper function should be used:

``````                                  outcomes     what is in Q    targets in P
-------------------------------------------------------------------------------
binary CE                                2      probability         any
categorical CE                          >2      probability         soft
sparse categorical CE                   >2      probability         hard
sigmoid CE with logits                   2      score               any
softmax CE with logits                  >2      score               soft
sparse softmax CE with logits           >2      score               hard
``````

In the end one could just use "categorical cross entropy", as this is how it is mathematically defined, however since things like hard targets or binary classification are very popular - modern ML libraries do provide these additional helper functions to make things simpler. In particular "stacking" sigmoid and cross entropy might be numerically unstable, but if one knows these two operations are applied together - there is a numerically stable version of them combined (which is implemented in TF).

It is important to notice that if you apply wrong helper function the code will usually still execute, but results will be wrong. For example if you apply softmax_* helper for binary classification with one output your network will be considered to always produce "True" at the output.

As a final note - this answer considers classification, it is slightly different when you consider multi label case (when a single point can have multiple labels), as then Ps do not sum to 1, and one should use sigmoid_cross_entropy_with_logits despite having multiple output units.

Saturday, June 12, 2021

29

From `model` documentation:

loss: String (name of objective function) or objective function. See losses. If the model has multiple outputs, you can use a different loss on each output by passing a dictionary or a list of losses. The loss value that will be minimized by the model will then be the sum of all individual losses.

...

loss_weights: Optional list or dictionary specifying scalar coefficients (Python floats) to weight the loss contributions of different model outputs. The loss value that will be minimized by the model will then be the weighted sum of all individual losses, weighted by the `loss_weights` coefficients. If a list, it is expected to have a 1:1 mapping to the model's outputs. If a tensor, it is expected to map output names (strings) to scalar coefficients.

So, yes, the final loss will be the "weighted sum of all individual losses, weighted by the `loss_weights` coeffiecients".

You can check the code where the loss is calculated.

Also, what does it mean during training? Is the loss2 only used to update the weights on layers where y2 comes from? Or is it used for all the model's layers?

The weights are updated through backpropagation, so each loss will affect only layers that connect the input to the loss.

For example:

``````                        +----+
> C  |-->loss1
/+----+
/
/
+----+    +----+/
-->| A  |--->| B  |
+----+    +----+

+----+
> D  |-->loss2
+----+
``````
• `loss1` will affect A, B, and C.
• `loss2` will affect A, B, and D.
Tuesday, July 27, 2021

75

You can use:

``````mse = ((A - B)**2).mean(axis=ax)
``````

Or

``````mse = (np.square(A - B)).mean(axis=ax)
``````
• with `ax=0` the average is performed along the row, for each column, returning an array
• with `ax=1` the average is performed along the column, for each row, returning an array
• with `ax=None` the average is performed element-wise along the array, returning a scalar value
Sunday, August 1, 2021

23

It does not work because `K.shape` returns you a symbolic shape, which is a tensor itself, not a tuple of int values. To get the value from a tensor, you have to evaluate it under a session. See documentation for this. To get a real value prior to evaluation time, use `K.int_shape`: https://keras.io/backend/#int_shape

However, `K.int_shape` also not gonna work here, as it is just some static metadata and won't normally reflect the current batch size, but has a placeholder value `None`.

The solution you found (have a control over the batch size and use it inside the loss) is indeed a good one.

I believe the problem is because you need to know the batch size at the definition time to build the Variable, but it will be known only during the session run time.

If you were working with it as with a tensor, it should be ok, see this example.

Thursday, November 25, 2021