Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

question about streaming infer #56

Open
VJJJJJJ1 opened this issue Nov 26, 2024 · 2 comments
Open

question about streaming infer #56

VJJJJJJ1 opened this issue Nov 26, 2024 · 2 comments

Comments

@VJJJJJJ1
Copy link

hi, I am trying to implement a streaming WavTokenizer. I set causal = True in encoder without other modification, and replace all nn.Conv1d in the decoder with SConv1d. For example, in WavTokenizer/decoder/modules.py, I changed self.dwconv = nn.Conv1d(dim, dim, kernel_size=7, padding=3, groups=dim) to self.dwconv = SConv1d(dim, dim, kernel_size=7, groups=dim, causal=True). In the AttenBlock, after multiplying q and k, I add a mask matrix as follows:

# compute attention
b, c, h = q.shape
q = q.permute(0, 2, 1)  # b,hw,c
w_ = torch.bmm(q, k)  # b,hw,hw    w_[b,i,j]=sum_c q[b,i,c]k[b,c,j]
w_ = w_ * (int(c) ** (-0.5))

# Apply causal mask
mask = torch.tril(torch.ones(h, h)).to(w_.device)  # mask matrix
w_ = w_.masked_fill(mask == 0, float('-inf'))      # Set mask to -inf
w_ = torch.nn.functional.softmax(w_, dim=2)

Is my modification correct? Unfortunately, during the experiment, distortion appeared at the end of the audio.

thank you for your reply!

@keepingitneil
Copy link

@VJJJJJJ1 I'm working on the same thing - want to collaborate?

@jishengpeng
Copy link
Owner

hi, I am trying to implement a streaming WavTokenizer. I set causal = True in encoder without other modification, and replace all nn.Conv1d in the decoder with SConv1d. For example, in WavTokenizer/decoder/modules.py, I changed self.dwconv = nn.Conv1d(dim, dim, kernel_size=7, padding=3, groups=dim) to self.dwconv = SConv1d(dim, dim, kernel_size=7, groups=dim, causal=True). In the AttenBlock, after multiplying q and k, I add a mask matrix as follows:

# compute attention
b, c, h = q.shape
q = q.permute(0, 2, 1)  # b,hw,c
w_ = torch.bmm(q, k)  # b,hw,hw    w_[b,i,j]=sum_c q[b,i,c]k[b,c,j]
w_ = w_ * (int(c) ** (-0.5))

# Apply causal mask
mask = torch.tril(torch.ones(h, h)).to(w_.device)  # mask matrix
w_ = w_.masked_fill(mask == 0, float('-inf'))      # Set mask to -inf
w_ = torch.nn.functional.softmax(w_, dim=2)

Is my modification correct? Unfortunately, during the experiment, distortion appeared at the end of the audio.

thank you for your reply!

Thank you for your attention. Our convolution operation directly modifies the padding logic. We will consider releasing a streaming version in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants