Skip to content

Conversation

@RobertAgee
Copy link
Collaborator

Full GPU Implementation and Optimization

  • Updated feature analysis to SOTA Speechbrain, TorchAudio, nn.audio methods
  • All feature analysis, tensor operations now on GPU, can run in parallel
  • Reused/simplified tensor operations whereever possible
  • Added minimum feature similarity (feat sim must be >=(best feat sim - 0.01)), before self-sim check to bypass expensive audio gen if feat sim regresses too much
  • ~3x faster on RTX4070 8GB, 10-15% CPU utilization on Intel i9-14900HX
  • <0.75 GB allocation / 1.5 reserved
  • patched Kokoro's memory leak (needs routine cache clearing, capped overhead memory usage)
  • 10,000 iterations:
Random Walk Final Results for gravelierjej
Duration: 56.80 minutes
Best Voice: out/gravelierjej_jejraven_20250616_161107/gravelierjej_9550_0.42_0.48_jejraven.pt
Best Score: 0.42
Best Similarity: 0.48
Random Walk pt and wav files ---> out/gravelierjej_jejraven_20250616_161107
0it [56:48, ?it/s, GPU Stats: 0.6771GB allocated, 1.4491GB reserved
Process Times: Audio1 gen: 0.286750s, Audio2 gen: 0.222268s, Target Sim: 0.020977s,  Self Sim: 0.018055s, Feat Sim: 0.049192s, Total: 0.597364s]

Settings Configuration, Debug, Memory, Process Times logging, misc

  • set true in utilities/kvw_config.json
  • loads automatically at program start
  • lots of stuff.... sweats lol

Stuff ToDo:

  • Most noted in code
  • Revisit scoring methodologies for any new optimizations possible (penalty, weighing, etc)
    • Notably SpeechBrain cosine similarity is more accurate that Resemblyzer
    • Idea: Use collection of target audios compared to themselves to get average similarity, use that for confirmation (~0.80)
  • Clean up docstrings
  • Move scorevoice() -> FitnessScorer
  • Add more feature-wise mutation strategy
  • Clean up variable naming (make it more legible)
  • Add convenient save/reloads
  • Offer disable checkpoint wav/pt saves (useful for early checkpoints, performance crippler)
  • Use smaller kokoro model
  • Diagnose where speech bottlenecks are, speed up
  • Clean up console printing
  • Consider if unifying speech/voice generators makes sense performance wise
  • Reduce signature objects in calls if possible
  • Add more GPU feature analysis
  • Merge some functionalities from my kokovoicelab fork

@RobertAgee
Copy link
Collaborator Author

Hey @RobViren Not going to push this on the main just yet, wanted to get your eyes on it and hopefully you get some time to try it out. 3x FASTER totally on GPU, and tiny footprint

@RobViren
Copy link
Owner

Oh dang! You've been at work on this. I had not heard of speechbrain. Is it still avoiding over fitting and sounding like a demon? Kudos on the better GPU usage, super impressive. Gonna run tonight

@RobertAgee
Copy link
Collaborator Author

Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go.

Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close.

another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations.

# Conflicts:
#	utilities/fitness_scorer.py
#	utilities/initial_selector.py
#	utilities/kvoicewalk.py
@RobViren
Copy link
Owner

RobViren commented Jun 17, 2025 via email

@RobertAgee
Copy link
Collaborator Author

Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes).

A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction..

B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters.

C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak.

C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster.

======

by the way, I just restarted my wsl instance completely and on a fresh instance, it's clocking a 7-8x speed up (as opposed 3x I'd thought). The audio is a little shorter, but fingers crossed the actual performance gains are higher than expected. If you get benchmarks on your system, please share!

Step:577  Target Sim:0.255 Self Sim:0.673 Feature Sim:0.296 Score:0.34 Diversity:0.10                            | 577/10000 [01:10<17:52,  8.79it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved Sim: 0.010980s,  Self Sim: 0.018089s, Feat Sim: 0.046040s, Total: 0.357781s]
                                                                                                                                                                                                        
Step:580  Target Sim:0.261 Self Sim:0.699 Feature Sim:0.291 Score:0.34 Diversity:0.05                            | 580/10000 [01:11<21:41,  7.24it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved Sim: 0.011041s,  Self Sim: 0.014831s, Feat Sim: 0.046310s, Total: 0.345149s]
                                                                                                                                                                                                        
Step:1228 Target Sim:0.262 Self Sim:0.689 Feature Sim:0.294 Score:0.35 Diversity:0.03                           | 1228/10000 [02:19<15:05,  9.68it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved]]
0it [02:19, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [02:19, ?it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved Sim: 0.010740s,  Self Sim: 0.014637s, Feat Sim: 0.050137s, Total: 0.356808s]
                                                                                                                                                                                                        
Step:1738 Target Sim:0.258 Self Sim:0.696 Feature Sim:0.304 Score:0.35 Diversity:0.06                           | 1738/10000 [03:11<13:48,  9.97it/s, GPU Stats: 0.6759GB allocated, 1.2562GB reserved]]

Random Walk Final Results for my_new_voice
Duration: 17.74 minutes
Best Voice: out/my_new_voice_tpih-78_20250616_215525/my_new_voice_5913_0.40_0.32_tpih-78.pt
Best Score: 0.40
Best Similarity: 0.32
Random Walk pt and wav files ---> out/my_new_voice_tpih-78_20250616_215525
0it [17:44, ?it/s, GPU Stats: 0.6800GB allocated, 1.2562GB reserved
Process Times: Audio1 gen: 0.090282s, Audio2 gen: 0.195020s, Target Sim: 0.010740s,  Self Sim: 0.016085s, Feat Sim: 0.052650s, Total: 0.364867s]

@RobertAgee
Copy link
Collaborator Author

RobertAgee commented Jun 17, 2025

Oh, and I should add too, taking the worst performers vs target_audio during the top_performers method, and use them to push the starting voice in the right direction strongly. So even if there's no great matching voice (e.g. no deep masculine voices in KokoroTTS) you can still use really "off" voices to your advantage.

Gravel - https://voca.ro/17dMqrKJIXLR
The Narrator - https://vocaroo.com/1lS1gUoIZYRu
Narrator Lite - https://vocaroo.com/1ezeDU6Nzw9R
King Arthur - https://voca.ro/13JsCly5B1oX

Here's a map to see where they live in Kokoro latent space (right hand side):

voice_pca_plot (2)

@hidoba
Copy link

hidoba commented Jul 4, 2025

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

@RobertAgee
Copy link
Collaborator Author

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

Depends on what you're trying to do (ie voice cloning vs voice crafting), and many different hammers can functionally do the same task.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants