-
Notifications
You must be signed in to change notification settings - Fork 21
Full GPU Rewrite, Performance Boost + misc #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
Benchmark <0.75GB VRAM usage
Benchmark <0.75GB VRAM usage
Similarity checker
|
Hey @RobViren Not going to push this on the main just yet, wanted to get your eyes on it and hopefully you get some time to try it out. 3x FASTER totally on GPU, and tiny footprint |
|
Oh dang! You've been at work on this. I had not heard of speechbrain. Is it still avoiding over fitting and sounding like a demon? Kudos on the better GPU usage, super impressive. Gonna run tonight |
|
Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go. Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close. another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations. |
# Conflicts: # utilities/fitness_scorer.py # utilities/initial_selector.py # utilities/kvoicewalk.py
|
Yeah. I really think maintaining a map of the results would really help to
guide other walks. Still believe genetic algo is the way to go. Only even
remotely feasible because of the small size. TPU would be great. It is just
monkeys on a keyboard trying to clone a voice.
…On Mon, Jun 16, 2025, 7:15 PM Robert Agee ***@***.***> wrote:
*RobertAgee* left a comment (RobViren/kvoicewalk#7)
<#7 (comment)>
Haha, yeah I've noticed that you really need a voice that's already pretty
close to get decent results. Resemblyzer will rate things as super similar
when in reality they aren't. SB by contrast is much more critical, even
rating target audios by the same speaker from the same recording as only a
partial match, though technically their cos sim threshold is only 0.25, so
that's why I think benchmarking a speaker against themselves is the way to
go.
Also, working with kokolab before I have a huge voice library and a
toolbox of different voice model "surgery" methods that I think would fit
nicely into a methodology here so it's like surgery->randomwalk->surgery,
repeat until it's pretty close.
another idea, there's porting kokoro to tpu as it's by far the biggest
time sink, so it would be possible to get like another 5x or more speed
boost to search the latent space faster, plus do batch evals for the best
directional heading for mutations.
—
Reply to this email directly, view it on GitHub
<#7 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEDLIDD7T64HWR2POUQX2DL3D5M3ZAVCNFSM6AAAAAB7OTOX3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZYGUZTEMBYGY>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes). A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction.. B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters. C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak. C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster. ====== by the way, I just restarted my wsl instance completely and on a fresh instance, it's clocking a 7-8x speed up (as opposed 3x I'd thought). The audio is a little shorter, but fingers crossed the actual performance gains are higher than expected. If you get benchmarks on your system, please share! |
|
Oh, and I should add too, taking the worst performers vs target_audio during the top_performers method, and use them to push the starting voice in the right direction strongly. So even if there's no great matching voice (e.g. no deep masculine voices in KokoroTTS) you can still use really "off" voices to your advantage. Gravel - https://voca.ro/17dMqrKJIXLR Here's a map to see where they live in Kokoro latent space (right hand side): |
|
Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/ |
Depends on what you're trying to do (ie voice cloning vs voice crafting), and many different hammers can functionally do the same task. |

Full GPU Implementation and Optimization
Settings Configuration, Debug, Memory, Process Times logging, misc
Stuff ToDo: