Bug fix in `ModularTokenizer.decode()` when the input is `torch.Tensor` type #376

SagiPolaczek · 2024-10-19T11:28:55Z

Another "solution" would be to validate that the input value is a list of integers.

The main problem in the current implementation, is that the user is not aware of the type mismatch - the token just appears as missing (!)

mosheraboh · 2024-10-21T07:31:51Z

fuse/data/tokenizers/modular_tokenizer/modular_tokenizer.py

@@ -1148,6 +1149,9 @@ def decode(self, ids: Iterable, skip_special_tokens: Optional[bool] = False) ->
        Returns:
            str: _description_
        """
+        if isinstance(ids, Tensor):
+            # Tokens in 'self.decoder_dict' are integers, and not singletons


singletons?

Do we need also to move it to CPU before? in cases the tensor is on the GPU.

Good question, so it's being done automatically:

Returns the tensor as a (nested) list. For scalars, a standard Python number is returned, just like with item(). Tensors are automatically moved to the CPU first if necessary.

https://pytorch.org/docs/stable/generated/torch.Tensor.tolist.html

singletons?

Meaning a tensor with a single item in it. Will make it clearer

mosheraboh

LGTM

convert to list in case of a tensor, avoiding using .item()

85c2277

mosheraboh reviewed Oct 21, 2024

View reviewed changes

clear doc

743ae45

mosheraboh approved these changes Oct 21, 2024

View reviewed changes

SagiPolaczek merged commit 7c59aff into master Oct 21, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix in `ModularTokenizer.decode()` when the input is `torch.Tensor` type #376

Bug fix in `ModularTokenizer.decode()` when the input is `torch.Tensor` type #376

SagiPolaczek commented Oct 19, 2024 •

edited

Loading

mosheraboh Oct 21, 2024

mosheraboh Oct 21, 2024

SagiPolaczek Oct 21, 2024

SagiPolaczek Oct 21, 2024

mosheraboh left a comment

Bug fix in ModularTokenizer.decode() when the input is torch.Tensor type #376

Bug fix in ModularTokenizer.decode() when the input is torch.Tensor type #376

Conversation

SagiPolaczek commented Oct 19, 2024 • edited Loading

mosheraboh Oct 21, 2024

Choose a reason for hiding this comment

mosheraboh Oct 21, 2024

Choose a reason for hiding this comment

SagiPolaczek Oct 21, 2024

Choose a reason for hiding this comment

SagiPolaczek Oct 21, 2024

Choose a reason for hiding this comment

mosheraboh left a comment

Choose a reason for hiding this comment

Bug fix in `ModularTokenizer.decode()` when the input is `torch.Tensor` type #376

Bug fix in `ModularTokenizer.decode()` when the input is `torch.Tensor` type #376

SagiPolaczek commented Oct 19, 2024 •

edited

Loading