Mimicking anyone's voice using RVC 2 (in real-time!) Part 3 - Real-time

↝      Description

In the last year, I have fiddled around a ton with Generative AI models, though that has always been in the visual sector; be it AI upscaling or my dives into generating images using Disco Diffusion, and later on Stable Diffusion (which I still do very frequently). Apparently, not only the visual AI models have taken huge strides in the last few months, but the audio side of things as well.

↝      Tags

#AI

↝      Date

August 4, 2023

Part 1&2?

Check out how to actually train a voice here if you haven't already →,

and how to transfer it to existing audio vocals here →.

Setting up the client

For real-time voice conversion, we're going to use W-Okadas Voice Changer client. You can find its GitHub repository here →.
Scrolling down, you'll need the CUDA version if you're running NVidia cards, on AMD you'll need to look for DirectML. Sadly, there's still no Linux version available of this client, so you'll need to use Windows for this.

Run, VC client, run!

Once you've downloaded the client, you'll need to extract it. Look for the "start_http.bat" and run it; Ignore the Windows safety warnings (trusting open source software and all that).

The client already has a few models as presets, though of course we want to use our trained voice model from Part 1&2. To do so, click on the "Edit" button of the voice models, scroll down to an empty slot, and "Upload" a new RVC model. For the Model, choose your .pth file from the training process that you were happy with; as for the Index, we actually don't need it, so leave that empty.

This is what it should look like when adding your model

You can now select the model, click on "Start", and try it out! There is some delay between you speaking and the trained voice speaking back, but we'll be able to adjust that with a few settings later. Take it for a spin and see how it sounds!

Turning the knobs

↝       Gain

The output might be a little quieter than we'd like, so turning up the gain of "out" to something like 2 might fix that; Just keep in mind that we don't want to blow anyone's ears off.

↝       Tune

This is actually the pitch of the output, so if your trained voice is higher or lower than your actual one, fiddle with this slider until it sounds right.

↝       Chunk

The chunk size determines how much of your voice is processed at once. The higher the value, the more delay there will be between you speaking and the output, but as the model has more data to work with, it'll be more accurate. I'd recommend starting with a value of 384 (which will be one second of delay), and see how much further down you're willing to go.

↝       Threshold

I've never seen this make the output sound better by turning it down, so just max it out at 0.001.

Making it work in other apps

Great, so we can now hear what we say in someone's else voice a second or so later. But how do we get to the fun part of actually using it?

We need to "pipe" the output from the client as a microphone input into other apps. To do so, we'll need to install a virtual audio cable. I'd recommend VB-Cable →.
Install it (you might need to restart your PC), and then go to your sound settings. Under "Input", you should now see "CABLE Output (VB-Audio Virtual Cable)". Set that as your default input device. In the VC client, set the input device to your microphone, and the output device to the VB-Cable input.

Now, you can go to Discord, Zoom, or whatever other app you want to use this in, and set the input device to the VB-Cable output (if you can, just to be sure it doesn't have a fixed setting of your normal microphone). You should now be able to use your trained voice in real-time in any app you want! Call a fellow student as your professor and tell them they're failing your class, or call your boss as your coworker and tell them you're quitting. The possibilities are endless! (Please don't actually do this, this is just a joke, don't do anything malicious with this. Please please please. Thank you.)

Moral complications & Disclaimer

As mentioned before, please don't use this information for malicious purposes. I'm not responsible for any damage you may cause with this information. I'm just a guy who likes to play around with AI models, and I'm sharing my findings with you.

Now we've run through the whole process of cloning someone else's voice and being able to convincingly use it in real-time. As exciting as this sounds and as much fun as it is to play around with, I'd like to take a moment to talk about the moral implications of this.
As with any technology, there are good and bad ways to use it. I'm not here to tell you what to do with this information, but I'd like to ask you to think about the consequences of your actions. I'm not going to go into detail about what could happen if this technology is used maliciously, but I'm sure you can imagine the possibilities. Be careful who you are talking to over the phone, what data you're giving out, and what you're saying in general. In this tutorial we've seen that with just a few minutes of training data that can come from anywhere, be it voice recordings, recording a call or a short video, we can create a model that can convincingly mimic someone. Don't trust everything you hear, and be mindful of other people's and your own privacy.

I'm planning to write more about the importance of privacy and an open internet in the future.
If you have any questions, feel free to send me an email. I'll try to answer as soon as possible.

I'd like to thank this video by Jarod →,which helped me a lot in understanding the process of setting it up and using it for myself.
Thank you for reading!

←      Go Back