Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use default value of initial_scale_power if FP16 scaling params not provided #4986

Closed

Conversation

ShukantPal
Copy link
Contributor

@ShukantPal ShukantPal commented Jan 21, 2024

The dynamic_loss_scale_args is None if some scaling param is not specified in the config: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215

In that case, it seems like DeepSpeed is using 2^32 as the initial_scale instead of the 2^16 as specified in the docs here: https://www.deepspeed.ai/docs/config-json/#fp16-training-options

…rovided

The dynamic_loss_scale_args is None if some scaling param is not specified in the config: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215

In that case, it seems like DeepSpeed is using 2**32 as the initial_scale instead of the 2**16 as specified in the docs here: https://github.com/microsoft/DeepSpeed/blob/9d2660d2a3fac767972f01ac96858b2605ffc0e4/deepspeed/runtime/config.py#L215
@loadams loadams requested a review from tohtana as a code owner January 22, 2025 17:28
@loadams loadams removed the request for review from mrwyattii January 22, 2025 17:33
@loadams loadams self-assigned this Jan 22, 2025
@loadams
Copy link
Collaborator

loadams commented Jan 29, 2025

@ShukantPal - I know this PR is old, but I see the following error now on this PR:

            raise TimeoutError
        if self._success:
            return self._value
        else:
>           raise self._value
E           Exception: Current loss scale already at minimum - cannot decrease scale anymore. Exiting run.

@ShukantPal
Copy link
Contributor Author

Hi @loadams, I no longer have the bandwidth to support this PR (switched jobs :)). Feel free to close if this change is no longer applicable.

@loadams
Copy link
Collaborator

loadams commented Jan 29, 2025

Hi @loadams, I no longer have the bandwidth to support this PR (switched jobs :)). Feel free to close if this change is no longer applicable.

No problem @ShukantPal - thanks for the update. I'll close this PR and open an issue to track the bug and make the needed fixes.

@loadams loadams closed this Jan 29, 2025
@tjruwase
Copy link
Contributor

@loadams, I will tackle this in #6976

@loadams
Copy link
Collaborator

loadams commented Jan 29, 2025

@loadams, I will tackle this in #6976

Thanks, you beat me to it!

tjruwase added a commit that referenced this pull request Jan 29, 2025
tjruwase added a commit that referenced this pull request Feb 6, 2025
Signed-off-by: Olatunji Ruwase <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants