Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorboard not working with Trainer Pattern #20809

Open
GeraudK opened this issue Jan 24, 2025 · 7 comments
Open

Tensorboard not working with Trainer Pattern #20809

GeraudK opened this issue Jan 24, 2025 · 7 comments
Assignees
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug

Comments

@GeraudK
Copy link

GeraudK commented Jan 24, 2025

I'm using the Keras Trainer pattern as illustrated here. The issue when using this pattern is that when you use Tensorboard only the top level weights are being recorded.

The reason for this is that Tensorboard is recording the weights for the all the layers in self.model.layers here. But this equal to [<Sequential name=sequential, built=True>] and the weights for that Sequential object is []

I tried several things:

  1. Passing a CallBackList to the Tensorflow Trainer when calling fit passing model_a instead of trainer_a, but this fails because model_a has no optimizer
  2. I tried to overwrite the layers method in the Trainer object to have recursive=True but the weights were still not showing in TensorBoard suggesting that something else is going on

I'm open to any suggestions here.

full example

import os

os.environ["KERAS_BACKEND"] = "tensorflow"

import tensorflow as tf
import keras
from keras.callbacks import TensorBoard

# Load MNIST dataset and standardize the data
mnist = keras.datasets.mnist
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0

class MyTrainer(keras.Model):
    def __init__(self, model):
        super().__init__()
        self.model = model
        # Create loss and metrics here.
        self.loss_fn = keras.losses.SparseCategoricalCrossentropy()
        self.accuracy_metric = keras.metrics.SparseCategoricalAccuracy()

    @property
    def metrics(self):
        # List metrics here.
        return [self.accuracy_metric]

    def train_step(self, data):
        x, y = data
        with tf.GradientTape() as tape:
            y_pred = self.model(x, training=True)  # Forward pass
            # Compute loss value
            loss = self.loss_fn(y, y_pred)

        # Compute gradients
        trainable_vars = self.trainable_variables
        gradients = tape.gradient(loss, trainable_vars)

        # Update weights
        self.optimizer.apply_gradients(zip(gradients, trainable_vars))

        # Update metrics
        for metric in self.metrics:
            metric.update_state(y, y_pred)

        # Return a dict mapping metric names to current value.
        return {m.name: m.result() for m in self.metrics}

    def test_step(self, data):
        x, y = data

        # Inference step
        y_pred = self.model(x, training=False)

        # Update metrics
        for metric in self.metrics:
            metric.update_state(y, y_pred)
        return {m.name: m.result() for m in self.metrics}

    def call(self, x):
        # Equivalent to `call()` of the wrapped keras.Model
        x = self.model(x)
        return x

model_a = keras.models.Sequential(
    [
        keras.layers.Flatten(input_shape=(28, 28)),
        keras.layers.Dense(256, activation="relu"),
        keras.layers.Dropout(0.2),
        keras.layers.Dense(10, activation="softmax"),
    ]
)

callbacks = [TensorBoard(histogram_freq=1)]
trainer_1 = MyTrainer(model_a)
trainer_1.compile(optimizer=keras.optimizers.SGD())
trainer_1.fit(
    x_train, y_train, epochs=5, batch_size=64, validation_data=(x_test, y_test), callbacks=callbacks,
)
@GeraudK
Copy link
Author

GeraudK commented Jan 29, 2025

After some more investigation I found out that nothing is wrong with TensorBoard or the Trainer pattern but the issue resided on how the weights are named (basically they aren't unique). In TensorBoard when you look into the self.model.layers[0].weight object you find the following list:

[<KerasVariable shape=(784, 256), dtype=float32, path=sequential_2/dense_4/kernel>,
 <KerasVariable shape=(256,), dtype=float32, path=sequential_2/dense_4/bias>,
 <KerasVariable shape=(256, 10), dtype=float32, path=sequential_2/dense_5/kernel>,
 <KerasVariable shape=(10,), dtype=float32, path=sequential_2/dense_5/bias>]

Now if you look at the name of each weight you will find the following:

['kernel', 'bias', 'kernel', 'bias']

This leas TensorBaord to save everything under the same name: kernel and bias

Image

@GeraudK
Copy link
Author

GeraudK commented Jan 29, 2025

I fixed it like this but obviously this is not the underlying cause:

from keras.callbacks import TensorBoard as Base
from pathlib import Path

__all__ = ["TensorBoard"]


def find_new_name(weights_names: list, name: str) -> str:
    if name not in weights_names:
        return name

    for i in range(1, 1000):
        new_name = f"{name}_{i}"
        if new_name not in weights_names:
            return new_name
    raise ValueError("Could not find a new name")


class TensorBoard(Base):
    weights_names = {}

    def get_weight_name(self, weight):
        key = id(weight)
        name = self.weights_names.get(key)
        values = self.weights_names.values()

        if name is None:
            name = str(Path(*Path(weight.path).parts[2:]))
            name = find_new_name(values, name)
            self.weights_names[key] = name

        return name

    def _log_weights(self, epoch):
        """Logs the weights of the Model to TensorBoard."""
        with self._train_writer.as_default():
            for layer in self.model.layers:
                for weight in layer.weights:
                    weight_name = self.get_weight_name(weight)
                    # Add a suffix to prevent summary tag name collision.
                    histogram_weight_name = weight_name + "/histogram"
                    self.summary.histogram(histogram_weight_name, weight, step=epoch)
                    if self.write_images:
                        # Add a suffix to prevent summary tag name
                        # collision.
                        image_weight_name = weight_name + "/image"
                        self._log_weight_as_image(weight, image_weight_name, epoch)
            self._train_writer.flush()

@dhantule dhantule added type:Bug keras-team-review-pending Pending review by a Keras team member. labels Jan 30, 2025
@harshaljanjani
Copy link
Contributor

harshaljanjani commented Feb 6, 2025

Thanks for posting the issue @GeraudK!
I believe I might be closer to the fix, and it now produces the intended behavior, but not without a compromise on the old test cases. It's breaking in nature, and I'm yet to find a tradeoff, but I'm just posting it here to check if this is indeed the behavior that you request. Thanks!

Old Behavior:

Image

New Behavior:

Image

Edit: I do have the fix, which is backward compatible and replicates the above behavior, but I think it's worthwhile to wait and deliberate over since the objectives of this issue have become much clearer.

@GeraudK
Copy link
Author

GeraudK commented Feb 6, 2025

Hi @harshaljanjani thank you for taking on this issue. Yes, this is somehow what I'm looking for although I would imagine to have separate section per layer like below for layers last-neuron and output-block, which are different layers in my model (ignore the gradients part).

Image

@harshaljanjani
Copy link
Contributor

Sounds great @GeraudK. Honestly, while that sounds like a lovely change, it would require a major refactor while ensuring backward compatibility, which would require guidance from the collaborators. Till then, I think the primary focus should be on the task of figuring out if a minor change may fix the primary bug to which this issue was raised, and maybe the feature you're describing can be incorporated in a later PR down the issue thread; thanks!

@GeraudK
Copy link
Author

GeraudK commented Feb 6, 2025

@harshaljanjani I'm not talking about the gradients, which of course is beyond this issue. But I think that TensorBoard should at least show different histograms for different layers, otherwise what's the point? I mean the main goal of using these histograms is to see change in weights from one epoch to the other and see what part of the network is training. If you can't have that granularity then I'm not sure what this is used for.

Searching for a few histograms examples on the web shows that this used to be the case, and you would always have one section per layer (like another image below)

Image

@harshaljanjani
Copy link
Contributor

I see the broader scope of the issue now. It seemed to me that the focus was on identifying the root cause of the combination of the parameters in the source code and replicating the division akin to the hack fix in the thread, but I now see that bringing back the layer-wise division and interpretable histograms is the key concern; thanks for the clarification.
To the collaborators and members, I’d really appreciate your thoughts on the refactor and the best way to move forward. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
keras-team-review-pending Pending review by a Keras team member. type:Bug
Projects
None yet
Development

No branches or pull requests

4 participants