Full GPU Rewrite, Performance Boost + misc #7

RobertAgee · 2025-06-16T23:20:47Z

Full GPU Implementation and Optimization

Updated feature analysis to SOTA Speechbrain, TorchAudio, nn.audio methods
All feature analysis, tensor operations now on GPU, can run in parallel
Reused/simplified tensor operations whereever possible
Added minimum feature similarity (feat sim must be >=(best feat sim - 0.01)), before self-sim check to bypass expensive audio gen if feat sim regresses too much
~3x faster on RTX4070 8GB, 10-15% CPU utilization on Intel i9-14900HX
<0.75 GB allocation / 1.5 reserved
patched Kokoro's memory leak (needs routine cache clearing, capped overhead memory usage)
10,000 iterations:

Random Walk Final Results for gravelierjej
Duration: 56.80 minutes
Best Voice: out/gravelierjej_jejraven_20250616_161107/gravelierjej_9550_0.42_0.48_jejraven.pt
Best Score: 0.42
Best Similarity: 0.48
Random Walk pt and wav files ---> out/gravelierjej_jejraven_20250616_161107
0it [56:48, ?it/s, GPU Stats: 0.6771GB allocated, 1.4491GB reserved
Process Times: Audio1 gen: 0.286750s, Audio2 gen: 0.222268s, Target Sim: 0.020977s,  Self Sim: 0.018055s, Feat Sim: 0.049192s, Total: 0.597364s]

Settings Configuration, Debug, Memory, Process Times logging, misc

set true in utilities/kvw_config.json
loads automatically at program start
lots of stuff.... sweats lol

Stuff ToDo:

Most noted in code
Revisit scoring methodologies for any new optimizations possible (penalty, weighing, etc)
- Notably SpeechBrain cosine similarity is more accurate that Resemblyzer
- Idea: Use collection of target audios compared to themselves to get average similarity, use that for confirmation (~0.80)
Clean up docstrings
Move scorevoice() -> FitnessScorer
Add more feature-wise mutation strategy
Clean up variable naming (make it more legible)
Add convenient save/reloads
Offer disable checkpoint wav/pt saves (useful for early checkpoints, performance crippler)
Use smaller kokoro model
Diagnose where speech bottlenecks are, speed up
Clean up console printing
Consider if unifying speech/voice generators makes sense performance wise
Reduce signature objects in calls if possible
Add more GPU feature analysis
Merge some functionalities from my kokovoicelab fork

Benchmark <0.75GB VRAM usage

Similarity checker

RobertAgee · 2025-06-16T23:22:51Z

Hey @RobViren Not going to push this on the main just yet, wanted to get your eyes on it and hopefully you get some time to try it out. 3x FASTER totally on GPU, and tiny footprint

RobViren · 2025-06-16T23:57:28Z

Oh dang! You've been at work on this. I had not heard of speechbrain. Is it still avoiding over fitting and sounding like a demon? Kudos on the better GPU usage, super impressive. Gonna run tonight

RobertAgee · 2025-06-17T00:15:35Z

Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go.

Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close.

another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations.

# Conflicts: # utilities/fitness_scorer.py # utilities/initial_selector.py # utilities/kvoicewalk.py

RobViren · 2025-06-17T00:48:15Z

Yeah. I really think maintaining a map of the results would really help to guide other walks. Still believe genetic algo is the way to go. Only even remotely feasible because of the small size. TPU would be great. It is just monkeys on a keyboard trying to clone a voice.

…

On Mon, Jun 16, 2025, 7:15 PM Robert Agee ***@***.***> wrote: *RobertAgee* left a comment (RobViren/kvoicewalk#7) <#7 (comment)> Haha, yeah I've noticed that you really need a voice that's already pretty close to get decent results. Resemblyzer will rate things as super similar when in reality they aren't. SB by contrast is much more critical, even rating target audios by the same speaker from the same recording as only a partial match, though technically their cos sim threshold is only 0.25, so that's why I think benchmarking a speaker against themselves is the way to go. Also, working with kokolab before I have a huge voice library and a toolbox of different voice model "surgery" methods that I think would fit nicely into a methodology here so it's like surgery->randomwalk->surgery, repeat until it's pretty close. another idea, there's porting kokoro to tpu as it's by far the biggest time sink, so it would be possible to get like another 5x or more speed boost to search the latent space faster, plus do batch evals for the best directional heading for mutations. — Reply to this email directly, view it on GitHub <#7 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEDLIDD7T64HWR2POUQX2DL3D5M3ZAVCNFSM6AAAAAB7OTOX3CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSNZYGUZTEMBYGY> . You are receiving this because you were mentioned.Message ID: ***@***.***>

RobertAgee · 2025-06-17T02:26:40Z

Agreed! I have some ideas for making a 'smart' randomwalk wherein it can do 3 things (together or as separate explorative modes).

A: Batch process and compare like 10-20 samples at once. Rank them in scoring and along the axis of improved score, continue randomizing into those select nodes and/or negative blend with the worst voices. Just like moving away from a voice in kokovoicelab. Continue until voice score begins to degrade then go back to scanning direction in batch comparison, rinse repeat. Like picking up on a signal but not sure where it's coming from. Just keep going until it gets fainter the reassess next direction..

B: As the voice gets closer to the target audio similarity, decrease the size of randomization allowed. When far away, move at lightspeed, when close by go to impulse thrusters.

C: Feature-iterative randomwalk - target feature order of human recognition importance: 1. Pitch, 2. Prosody, etc... whatever that order might be, but maximize feature similarity for one feature at a time, then move to maximize next without retrograding the previous features. A round about approach of using one planet's gravity to slow you down without leaving the solar system, so to speak.

C2: Feature-focused randomwalk - target single feature and maximize for similarity. Create a matching voice for each feature. Then cobble them together like frankenstein's monster.

======

by the way, I just restarted my wsl instance completely and on a fresh instance, it's clocking a 7-8x speed up (as opposed 3x I'd thought). The audio is a little shorter, but fingers crossed the actual performance gains are higher than expected. If you get benchmarks on your system, please share!

Step:577  Target Sim:0.255 Self Sim:0.673 Feature Sim:0.296 Score:0.34 Diversity:0.10                            | 577/10000 [01:10<17:52,  8.79it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved Sim: 0.010980s,  Self Sim: 0.018089s, Feat Sim: 0.046040s, Total: 0.357781s]
                                                                                                                                                                                                        
Step:580  Target Sim:0.261 Self Sim:0.699 Feature Sim:0.291 Score:0.34 Diversity:0.05                            | 580/10000 [01:11<21:41,  7.24it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved]]
0it [01:11, ?it/s, GPU Stats: 0.6749GB allocated, 1.2541GB reserved                                                                                                                                     
0it [01:11, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved Sim: 0.011041s,  Self Sim: 0.014831s, Feat Sim: 0.046310s, Total: 0.345149s]
                                                                                                                                                                                                        
Step:1228 Target Sim:0.262 Self Sim:0.689 Feature Sim:0.294 Score:0.35 Diversity:0.03                           | 1228/10000 [02:19<15:05,  9.68it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved]]
0it [02:19, ?it/s, GPU Stats: 0.6750GB allocated, 1.2541GB reserved                                                                                                                                     
0it [02:19, ?it/s, GPU Stats: 0.6794GB allocated, 1.2562GB reserved Sim: 0.010740s,  Self Sim: 0.014637s, Feat Sim: 0.050137s, Total: 0.356808s]
                                                                                                                                                                                                        
Step:1738 Target Sim:0.258 Self Sim:0.696 Feature Sim:0.304 Score:0.35 Diversity:0.06                           | 1738/10000 [03:11<13:48,  9.97it/s, GPU Stats: 0.6759GB allocated, 1.2562GB reserved]]

Random Walk Final Results for my_new_voice
Duration: 17.74 minutes
Best Voice: out/my_new_voice_tpih-78_20250616_215525/my_new_voice_5913_0.40_0.32_tpih-78.pt
Best Score: 0.40
Best Similarity: 0.32
Random Walk pt and wav files ---> out/my_new_voice_tpih-78_20250616_215525
0it [17:44, ?it/s, GPU Stats: 0.6800GB allocated, 1.2562GB reserved
Process Times: Audio1 gen: 0.090282s, Audio2 gen: 0.195020s, Target Sim: 0.010740s,  Self Sim: 0.016085s, Feat Sim: 0.052650s, Total: 0.364867s]

RobertAgee · 2025-06-17T02:49:46Z

Oh, and I should add too, taking the worst performers vs target_audio during the top_performers method, and use them to push the starting voice in the right direction strongly. So even if there's no great matching voice (e.g. no deep masculine voices in KokoroTTS) you can still use really "off" voices to your advantage.

Gravel - https://voca.ro/17dMqrKJIXLR
The Narrator - https://vocaroo.com/1lS1gUoIZYRu
Narrator Lite - https://vocaroo.com/1ezeDU6Nzw9R
King Arthur - https://voca.ro/13JsCly5B1oX

Here's a map to see where they live in Kokoro latent space (right hand side):

hidoba · 2025-07-04T16:16:40Z

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

RobertAgee · 2025-07-04T16:58:15Z

Why can't you do the gradient decent on the cross entropy, optimizing the voice embedding? Similarly to how we fine tune other models but optimize the weights/

Depends on what you're trying to do (ie voice cloning vs voice crafting), and many different hammers can functionally do the same task.

RobertAgee and others added 17 commits June 13, 2025 01:35

messy, started adding speechbrain implementation

96507c6

Create generated.mp3

aa3c0b6

upload readme audio files

7ccf4ea

Update and rename generated.mp3 to .gitkeep

a4912e8

add speechbrain to fitness_scorer for greater accuracy

ad8c9bb

GPU optimization in process

2fbc497

GPU optimization, tensor cleanup, memory use tracking

f2eede1

Further GPU optimization, memory use tracking

52dc9b6

Benchmark <0.75GB VRAM usage

Full GPU optimization, memory and timing logs, config settings

81f1ea2

messy, started adding speechbrain implementation

4675774

add speechbrain to fitness_scorer for greater accuracy

63910d3

GPU optimization in process

c7e6b6d

GPU optimization, tensor cleanup, memory use tracking

d9fcc8e

Further GPU optimization, memory use tracking

67f5f98

Benchmark <0.75GB VRAM usage

Full GPU optimization, memory and timing logs, config settings

4e67736

Merge branch 'RobViren:main' into similarity-checker

60aee79

Merge pull request #1 from RobertAgee/similarity-checker

b834cf1

Similarity checker

RobertAgee added 2 commits June 16, 2025 20:21

small lints

4535210

Merge remote-tracking branch 'origin/main'

5de2b40

# Conflicts: # utilities/fitness_scorer.py # utilities/initial_selector.py # utilities/kvoicewalk.py

RobertAgee added 3 commits June 16, 2025 21:48

Add packages to toml, update config settings

e607c0e

correct spelling in pyproject.toml

66b2688

correct spelling in pyproject.toml

cbc75f1

RobertAgee added 2 commits June 17, 2025 00:47

small bug fix for interpolate_start

f3cbf03

improve pt saving by moving to cpu first

30c1884

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Full GPU Rewrite, Performance Boost + misc #7

Full GPU Rewrite, Performance Boost + misc #7

Uh oh!

RobertAgee commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 16, 2025

Uh oh!

RobViren commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobViren commented Jun 17, 2025 via email

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobertAgee commented Jun 17, 2025 •

edited

Loading

Uh oh!

hidoba commented Jul 4, 2025

Uh oh!

RobertAgee commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Full GPU Rewrite, Performance Boost + misc #7

Are you sure you want to change the base?

Full GPU Rewrite, Performance Boost + misc #7

Uh oh!

Conversation

RobertAgee commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 16, 2025

Uh oh!

RobViren commented Jun 16, 2025

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobViren commented Jun 17, 2025 via email

Uh oh!

RobertAgee commented Jun 17, 2025

Uh oh!

RobertAgee commented Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hidoba commented Jul 4, 2025

Uh oh!

RobertAgee commented Jul 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

RobertAgee commented Jun 17, 2025 •

edited

Loading