Replies: 1 comment
-
|
Thanks for sharing your ideas! Regarding your first question: Yes, we do plan to open-source the SFT training code. You'll be able to follow the triplet construction method described in our technical report to organize your data and training pipeline. We're also actively working on adding support for high-frequency paralinguistic features such as [Cough]. As for your second question — that's a great idea! Currently, the tags are still a discrete closed set, though we've observed that the model can generalize to some unseen ones. In the future, we plan to support open-set natural language descriptions for tags, and we're already running experiments in this direction. We'll keep you posted as we make progress! |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Step-Audio-EditX 是一个非常好的项目,尤其是在语气词(副语言特征)方面,是我找到的在这方面做得最好的开源项目。
关于副语言特征的扩展我有一些想法,希望能探讨一下在当前模型基础上的可行性:
模型已经支持了如 [Breathing]、[Laughter] 等多种语气词。我想了解一下,是否有可能通过用户自己训练的方式来扩展这个词表?例如,我希望加入“咳嗽声”([Cough]) 或其他类似的语气词。这是否可以通过提供少量“咳嗽声”的音频样本,然后利用 LoRA 或完整微调的方式来实现?如果可行的话,官方是否有计划提供相关的训练脚本或指南?
目前文档中的标签似乎是原子化的,例如 Surprise-oh、Question-ei。我在想,是否有可能将它们解耦成组合式的标签进行训练,例如: [Surprise, oh]、[Question, ei]?
这么做或许能带来一个好处:如果模型学会了这种组合关系,它也许就能泛化出训练数据中从未出现过的新组合。例如,在分别学习了 Surprise 和 ei 之后,模型或许能自行合成出 [Surprise, ei] 的效果。如果第一点提到的自定义语气词也能实现,我们甚至可能创造出 [Surprise, Cough] 这样全新的的效果,而不必将大量语气词组合全部编码到词表里。
如果能实现以上两点,能极大地增强模型的能力上限和合成结果的自然程度。我并非专业技术人员,以上想法可能在技术上是暂时难以实现的,但还是期待能听到关于这方面的见解,谢谢!
Beta Was this translation helpful? Give feedback.
All reactions