Mimicking anyone's voice using RVC 2 (in real-time!) Part 2 - Voice transfer

↝      Description

In the last year, I have fiddled around a ton with Generative AI models, though that has always been in the visual sector; be it AI upscaling or my dives into generating images using Disco Diffusion, and later on Stable Diffusion (which I still do very frequently). Apparently, not only the visual AI models have taken huge strides in the last few months, but the audio side of things as well.

↝      Tags


↝      Date

August 2, 2023

Part 1?

Check out how to actually train a voice here if you haven't already →.

Vocal Isolation

Reiterating from part 1:

Dataset advice

Isolate your voice as much as possible, in the highest quality possible. This may seem like obvious advice, but it needs to be underlined: As for all machine learning: BS in, BS out.

I recommend using Ultimate Vocal Remover→ to isolate your desired voice using some nifty AI models. Of the models available, the best ones seem to be "Kim Vocal 2" and "MDX-NET_Main_438"; The latter is a VIP model, though you can get the VIP code from UVR's Patreon→.

For the best results, it's essential to eliminate reverb and echo from your dataset. Ideally, you should minimize these effects during the recording process. However, in case there just is reverb, there's a solution available under MDX-Net called Reverb HQ, which allows you to export reverbless audio by selecting the 'No Other' option. In some cases, Reverb HQ might not completely remove the echo. One option is to process the vocal output through a VR Architecture model: I'd recommend De-Echo-DeReverb. If, for some reason, that's still not enough, there's still the De-Echo normal model, which is the most aggressive echo removal tool available and can be used as a last resort.

To remove noise in the audio, I'd recommend using Audacity's Noise gate or its other noise removal tools.

Be sure to export both "Other" and "No other" versions of your audio in UVR, as you'll need the voice for the transfer process, and the music to put your transferred voice on top of in the final step.

Vocal Transfer

↝       Use volume envelope

The lower you set this to, the more it will capture the original volume range of the original song. A value of 1.0 will be equally loud throughout the whole conversion; 0 will make it mimic the volume range of the original as much as possible.
I would recommend you set this volume setting to a decently low value such as 0.25 or 0.2.

↝       Transpose / Pitch

This changes the given audio in half-steps. E.g., -12 or 12 will both stay in the original key but at different octaves. Good for extreme differences, like a male singer doing a song sang by a female.
Just fiddle around with this to see where your model performs best (and what sounds most like the person you trained the voice on). Remember what you set this to though, as you'll need to adjust the pitch of the original song (without the vocals) to match the converted voice, else it sounds off. You can do so in Audacity by going to Effect > Pitch > Change Pitch.

↝       Search feature ratio

This value controls how much influence the .index file has on the voice model's output. (The index mainly controls the "accent" for the model.)
If your model's dataset isn't very long or it's not very high quality (or both), this should be lower, and if it's a high quality model, you can afford to go a bit higher. Generally, my recommended value would be 0.6-0.8, and reduce it if you think it's truly necessary.

↝       Advanced Settings: Pitch extraction algorithm

The best option here is generally rmvpe, and I would recommend that, or mangio-crepe with different hop sizes between 64-192 for most cases. Mangio-crepe tends to be better for "smoothed out" results, but higher hop length values will lead to less pitch precision.
You should experiment to see which sounds best for your specific song if you're unsure.

Once you've dialed in your settings, you can go ahead and convert your voice! On a 4090 using mangio-crepe it is reasonably fast, converting a 3-minute song in about 10 seconds.

Vocal + not Vocals = Song

As mentioned before, you'll need to adjust the pitch of the original song to match the converted voice (if you've adjusted the pitch). Also, now is the time to use the "Other" version of your isolated voice (Which will be the song's instrumentals), as you'll need to put it on top of the converted voice. The simplest way I've found is to layer them on top of each other in Audacity, and then exporting as MP3. Et voilà, we're all done!

Part 3: Real-time voice conversion

In Part 3, I'll show you how to use the model we've trained to change your voice in real-time from microphone input →.


Please, please, please don't use this information for malicious purposes. I'm not responsible for any damage you may cause with this information. I'm just a guy who likes to play around with AI models, and I'm sharing my findings with you.

If you have any questions, feel free to send me an email. I'll try to answer as soon as possible.

A lot of this information was taken from kalomaze's RVC 2 cover guide→, just omitting some irrelevant information and adding my own findings of what works best and what doesn't.

←      Go Back