Hi authors,
Thank you for releasing the code and detailed appendices for “Score-Entropy Discrete Diffusion (SEDD)”—v3 (2024-10-06). The paper has been very helpful for my own work. While reading the supplementary Algorithm 1 (“Training with DWDSE”), I noticed a small mismatch between the theoretical loss in the main text and the pseudocode in Appendix B, and I wanted to check whether I’m misunderstanding something.
What I believe the correct weight should be
In Eq. (10) of the paper (and again in Theorem 3.6), the inner sum for the DWDSE term is weighted by the transition rate
$$Q_t(x_t, y)$$
For the two noise processes this specialises to:
| Noise kernel |
Non-diagonal weight $Q_t(x_t,y)$
|
| Uniform |
$\sigma(t) \cdot 1_{{y \neq x_t}}$ |
| Absorb |
$\sigma(t) \cdot 1_{{y = \text{[MASK]}}}$ |
What the pseudocode currently does
In Appendix B, Algorithm 1, the weight is hard-coded as:
\sigma(t) (1 - \delta_{x_t}(y)) // omit some sum operators
That equals $Q_t$ for the Uniform kernel, but for the Absorb kernel it would give non-zero weight to all $y\neq x_t$ (not only to [MASK]).
Request for confirmation
Could you please confirm whether the pseudocode line is indeed a small typo?
I might have misunderstood the intended interpretation of $Q_t$, so any clarification would be greatly appreciated. I believe this issue may alos help others interested in SEDD.
Thanks again for the excellent work and for open-sourcing everything.
Best regards,
Zhenwei
Hi authors,
Thank you for releasing the code and detailed appendices for “Score-Entropy Discrete Diffusion (SEDD)”—v3 (2024-10-06). The paper has been very helpful for my own work. While reading the supplementary Algorithm 1 (“Training with DWDSE”), I noticed a small mismatch between the theoretical loss in the main text and the pseudocode in Appendix B, and I wanted to check whether I’m misunderstanding something.
What I believe the correct weight should be
In Eq. (10) of the paper (and again in Theorem 3.6), the inner sum for the DWDSE term is weighted by the transition rate
For the two noise processes this specialises to:
What the pseudocode currently does
In Appendix B, Algorithm 1, the weight is hard-coded as:
That equals$Q_t$ for the Uniform kernel, but for the Absorb kernel it would give non-zero weight to all $y\neq x_t$ (not only to
[MASK]).Request for confirmation
Could you please confirm whether the pseudocode line is indeed a small typo?$Q_t$ , so any clarification would be greatly appreciated. I believe this issue may alos help others interested in SEDD.
I might have misunderstood the intended interpretation of
Thanks again for the excellent work and for open-sourcing everything.
Best regards,
Zhenwei