Mimicking anyone's voice using RVC 2 (in real-time!) Bonus - Text to Speech

↝      Description

In the last year, I have fiddled around a ton with Generative AI models, though that has always been in the visual sector; be it AI upscaling or my dives into generating images using Disco Diffusion, and later on Stable Diffusion (which I still do very frequently). Apparently, not only the visual AI models have taken huge strides in the last few months, but the audio side of things as well.

↝      Tags

#AI

↝      Date

August 11, 2023

Previously on: Voice Synthesis with me

After learning how to train a voice model on anyones voice →, transfering it to existing audio vocals →, and even using it in real-time voice calls →, I thought it would be a good idea to loop around and go a step further: Text to Speech.

There are already some paid (ew), closed-source (ew ew ew) tools for this out there, though of course we'll do it the open-source way using TorToiSe TTS.
If at any point you get lost, check out the official wiki here →. Also be sure to check out this video by Jarod →, which basically runs through this whole process and you can follow along quite nicely.

TorToiSetup

Before embarking on the exciting journey of voice cloning with TorToiSe, there are a few prerequisites you need to have in place:

Once we're ready to roll, let's go ahead and set up TorToiSe.

↝       Windows

  1. Open Command Prompt from the Start Menu and navigate to your desired working directory using cd.
  2. Run the command: git clone https://git.ecker.tech/mrq/ai-voice-cloning.
  3. Execute the setup script based on your GPU:
    • AMD: Run setup-directml.bat
    • NVIDIA: Run setup-cuda.bat

↝       Linux enthusiasts

  1. Download the TorToiSe repository by running: git clone https://git.ecker.tech/mrq/ai-voice-cloning.
  2. Navigate into the downloaded directory.
  3. Make shell scripts executable: chmod +x *.sh.
  4. Based on your GPU, run the respective setup script:
    • AMD: Run ./setup-rocm.sh
    • NVIDIA: Run ./setup-cuda.sh

With TorToiSe set up correctly, we can start the WebUI using ./start.bat on Windows or ./start.sh on Linux. This will (hopefully) alrready open a browser window on 127.0.0.1:7860. If not, you can go there yourself.

T(o)raining a voice

Now that we have TorToiSe set up, we can start training a voice model; one thing we need before diving into that is data. In german we have a saying: "Aller guten Dinge sind drei", which translates to "All good things come in threes". So without further ado, once more:

Dataset advice

Isolate your voice as much as possible, in the highest quality possible. This may seem like obvious advice, but it needs to be underlined: As for all machine learning: BS in, BS out.

I recommend using Ultimate Vocal Remover→ to isolate your desired voice using some nifty AI models. Of the models available, the best ones seem to be "Kim Vocal 2" and "MDX-NET_Main_438"; The latter is a VIP model, though you can get the VIP code from UVR's Patreon→.

For the best results, it's essential to eliminate reverb and echo from your dataset. Ideally, you should minimize these effects during the recording process. However, in case there just is reverb, there's a solution available under MDX-Net called Reverb HQ, which allows you to export reverbless audio by selecting the 'No Other' option. In some cases, Reverb HQ might not completely remove the echo. One option is to process the vocal output through a VR Architecture model: I'd recommend De-Echo-DeReverb. If, for some reason, that's still not enough, there's still the De-Echo normal model, which is the most aggressive echo removal tool available and can be used as a last resort.

To remove noise in the audio, I'd recommend using Audacity's Noise gate or its other noise removal tools.

Finally: You don't need to cut up your audio into pieces yourself; RVC will cut it into 4 second chunks automatically. As for how much training data you need I'd recommend at least a few minutes of audio but the more, the merrier. If you still have your dataset from the previous articles, just go ahead and use that!

Create a subfolder in /voices/, name it whatever you want your voice to be called and move your training audio files there.

Now that we have our data ready, we can finally start training! In the Generate Tab of the UI, click on Refresh voice list; Switch to the Training tab and select your voice from the dropdown menu under Dataset source. Leave all settings as they are - if you want a deep-dive into what they mean and how to change them accordingly, check out the wiki I linked earlier - and click on Transcribe and process. Once that's done and we have preprocessed our data for TorToiSe, switch to the Generate Cofiguration tab. Here we can change the settings for the actual training process. Again, select your dataset in the dropdown (you may need to refresh the dataset list once more) and choose your epochs:

With 320 voice samples (generated from the preprocess method under /training/your_voice/audio), I got great results with 200 epochs.

Leave the settings as they are once again, except for the Save and Validation frequency. Save on some disk space by choosing something like 50, as that should be enough to revert to an earlier model if you suspect overtraining without saving way too many models on your drive.

Click on Validate training configuration to smartly adjust the rest of the settings, and save the automatically edited training configuration with the aptly named Save training configuration.

Moving on to the Run training tab, refresh the config list and select the config we just saved. Now, everything is set up to finally, finally start the training process! Simply click on Train and watch the magic happen: On the right side, in the training metrics, we'll get to see the loss curves of the training process. Something nice to look at while waiting for it to finish (be patient, it might take a few hours depending on your GPU)!

TorToiSe: in action!

Just now, while writing this post, I realized the capitalization of TorToiSe is the way it is because of it being a TTS model. I'm not too quick on puns it seems.

Hitting Refresh model list in the Settings tab of the UI, we can now select our freshly trained model from the dropdown menu under Autoregressive Model. While we're there, uncheck Slimmer Computed Latents.

Switching to the Generate tab once more, we can at last generate some audio! Select your voice from the dropdown menu under Voice, enter some text as a prompt and brace yourself for now having your very own model that can generate audio from just text!

Of course, the default settings will not sound super convincing, so I'll post the ones I frequently use here, but feel free to play around and see what sticks for your case (more info? Head to the wiki!):

↝       Voice Chunks

This should automatically set to 0 as we've trained our own model, but just in case, set it to 0.

↝       Samples

The more samples, the better it sounds (significantly), but this also increases the time it takes to generate the audio (significantly). Start with 16 and go from there.

↝       Iterations

The more iterations, the better it sounds, but this also increases the time it takes to generate the audio. Start with 64 and go from there.

↝       Temperature

Increasing the temperature leads to more *interesting* results, be it for better or worse; it also highly depends on the trained voice. I still like it at 1.

Experimental settings:

↝       Length penalty

The model will try to generate shorter audio; in my case it helped a lot maxing it out to 8.

↝       Repetition penalty

Setting this to low values will lead to great results such as "Welcome to myyyyaiiiihhhaiiiihhhiiiiaaaaaaaiiii". I also often max this out at 8.

Big brain time

As you may notice, the audio generated from TorToiSe can sound alright, though it is often worse quality than what we previously accomplished with RVC 2. No wonder, as the model has less to work with and needs to come up with a convincing sounding voice from just text instead of having a voice already as a base to be modified. The ones amongst you who may be familiar with being called "Part-time Sherlock Holmes" might deduce what comes next: Let's run the generated audio from TorToiSe through our RVC model trained on the same voice! If you've forgotten how (I don't blame you), you can read up on it again in Part 2 of this series →. No need to split or pre-process the audio this time, just use it as input for RVC 2 and let it do its magic. Often times, the result should sound a lot more realistic and akin to the original voice!

Moral complications & Disclaimer

As it cannot be stressed enough, please don't use this information for malicious purposes. I'm not responsible for any damage you may cause with this information. I'm just a guy who likes to play around with AI models, and I'm sharing my findings with you.

Now we've run through the whole process of cloning someone else's voice and being able to convincingly use it to read any text we enter. As exciting as this sounds and as much fun as it is to play around with, I'd like to take a moment to talk about the moral implications of this.
As with any technology, there are good and bad ways to use it. I'm not here to tell you what to do with this information, but I'd like to ask you to think about the consequences of your actions. I'm not going to go into detail about what could happen if this technology is used maliciously, but I'm sure you can imagine the possibilities. Be careful who you are talking to over the phone, what data you're giving out, and what you're saying in general. In this tutorial we've seen that with just a few minutes of training data that can come from anywhere, be it voice recordings, recording a call or a short video, we can create a model that can convincingly mimic someone. Don't trust everything you hear, and be mindful of other people's and your own privacy.

I'm planning to write more about the importance of privacy and an open internet in the future.
If you have any questions, feel free to send me an email. I'll try to answer as soon as possible.

Thank you for reading!

←      Go Back