Add options for weight sharing#408
Closed
borzunov wants to merge 7 commits intolucidrains:mainfrom
Closed
Conversation
This was referenced Jan 10, 2022
Closed
Owner
|
@borzunov Interesting! Weight sharing has been shown to work with Albert, though it is still not popular in practice with language models. Will do some testing on my end before merging this one in! Thank you! |
Contributor
Author
|
@lucidrains Oh, I am afraid this code has been already merged as a part of #409. That PR was branched from this one, sorry for the confusion :( If you eventually decide that this feature shouldn't be in the repo, feel free to remove weight sharing (or I can prepare a PR removing it myself). |
Owner
|
No problem! Let's keep the feature! Thank you 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
This PR adds options for:
Sharing weights across some of the transformer layers. This technique was introduced in the ALBERT paper. If you want a model with shared weights to match the quality of a model without sharing, you need more compute but less parameters. Thus, it is a way to trade extra computations for reduced GPU memory consumption and reduced communication between GPUs (in case of distributed training).
Sharing weights for input and output embeddings. This technique is commonly used in seq2seq models. For example, there is the
--share-input-output-embedoption in fairseq.Experiments
Originally, I've implemented this for our project where we train a DALL-E-like model collaboratively over the Internet on LAION-400M.
We use both options above in a model with the following config:
The model demonstrates reasonable outputs, so we can indeed use weight sharing for DALL-E in practice. These pictures are generated after passing 1/3 of the training schedule: