Inconsistencies with the behavior of bias initializers, leading to poor performance in some cases

Hello,

I've noticed some (potentially harmful) inconsistencies in bias initializers when running a simple test of the keras package, i.e. using a shallow MLP to learn a sine wave function in the [-1, 1] interval.

# Context

Most of the times (or for deep enough networks), using the default zero-initialization for biases is fine. However, for this simple problem having randomized biases is essential, since without them the neurons end up being too similar (redundant) and training converges to a very poor local optimum.

The [official guide](https://keras.io/api/layers/initializers/#variancescaling-class) suggests to use weight initializers for biases as well.

Now:

* The default initialization from _native_ PyTorch leads to good results that improve as expected as the network size grows.
* Several keras initializers are expected to be similar or identical to the PyTorch behavior (i.e. `VarianceScaling` and all its subclasses), but they fail to produce good results, regardless of the number of neurons in the hidden layer.

# Issues

The issues are due to the fact that all [RandomInitializer](https://github.com/keras-team/keras/blob/fbf0af76130beecae2273a513242255826b42c04/keras/src/initializers/random_initializers.py#L10) subclasses in their `__call__` function only have access to the shape they need to fill.

In case of bias vectors for `Dense` layers, this shape is a one element tuple, i.e. `(n,)` where `n` is the number of units in the current layer.

The [compute_fans function](https://github.com/keras-team/keras/blob/fbf0af76130beecae2273a513242255826b42c04/keras/src/initializers/random_initializers.py#L612) in this case reports a fan in of `n`, which is actually the number of units, i.e. the fan out.

Unfortunately, the correct fan in is not accessible, since the number of layer inputs is not included in the shape of the bias vector.

This makes the [official description of the VarianceScaling initializer](https://keras.io/api/layers/initializers/#variancescaling-class) incorrect when applied to neuron biases. The same holds for the description of the Glorot, He, LeCun initializers, which are implemented as `VarianceScaling` subclasses.

In my simple example, as soon as the shallow network has more than very few neurons, all size-dependent initializers have so little variability that they behave very similar to a zero initialization (i.e. incredibly poorly). What stumped me (before understanding the problem) is that the larger is the network, the worse the behavior.

# About possible fixes

I can now easily fix the issue by computing bounds for `RandomUniform` initializers externally so as to replicate the default PyTorch behavior, but this is not an elegant solution -- and I am worried other users may have encountered similar problems without noticing.

If the goal is correctly computing the fan in, I am afraid that I see no easy fix, short of restructuring the `RandomInitializer` API and giving it access to more information.

However, the real goal here is not actually computing the fan in, but preserving the properties that the size-dependent initializers were attempting to enforce. I would need to read more literature on the topic before suggesting a theoretically sound fix from this perspective. I would be willing to do that, in case the keras teams is fine with going in this direction.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistencies with the behavior of bias initializers, leading to poor performance in some cases #20873

Context

Issues

About possible fixes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistencies with the behavior of bias initializers, leading to poor performance in some cases #20873

Description

Context

Issues

About possible fixes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions