How I Cloned My Wife’s Voice to Calm The Baby

This idea came up when my wife started to plan a getaway with her friend, which left me in charge of the baby for an extended period alone. As any good engineer would do, I started to prepare my disaster recovery plan. This started with snacks, songs, locations, and of course a voice clone of my wife to calm the baby.

In this tutorial we will be creating a cloned voice using a 60 second sample wav file of the voice. After the clone has been created, we can input text and get the audio back. I have tested all the open-source models available and will only include the model with the best results in the tutorial.

The Build

First, lets install our requirements for the project

pip install styletts2

Next, let’s generate our voice embedding that can be used to generate new audio in the same style. When using StyleTTS2 python package this process is done seamlessly during the inference stage, but lets look at how its done.

import librosa

def preprocess(wave):
wave_tensor = torch.from_numpy(wave).float()
mel_tensor = to_mel(wave_tensor)
mel_tensor = (torch.log(1e-5 + mel_tensor.unsqueeze(0)) - mean) / std
return mel_tensor

def compute_style(self, path):
"""
Compute style vector, essentially an embedding that captures the characteristics
of the target voice that is being cloned
:param path: Path to target voice audio file
:return: style vector
"""
wave, sr = librosa.load(path, sr=24000)
audio, index = librosa.effects.trim(wave, top_db=30)
if sr != 24000:
audio = librosa.resample(audio, sr, 24000)
mel_tensor = preprocess(audio).to(self.device)

with torch.no_grad():
ref_s = self.model.style_encoder(mel_tensor.unsqueeze(1))
ref_p = self.model.predictor_encoder(mel_tensor.unsqueeze(1))
return torch.cat([ref_s, ref_p], dim=1)

Now that we have our voice embedding, it can be used to generate new voice audio in that style.

from styletts2 import tts
tts = tts.StyleTTS2()

text = """Rock-a-bye baby on the treetops.
When the wind blows, the cradle will rock.
When the bough breaks, the cradle will fall.
And down will come baby, cradle and all.
"""

output = 'output-test.wav'
out = tts.long_inference(text,
alpha=0.3,
beta=0.6,
diffusion_steps=5,
output_wav_file=output,
target_voice_path='./wifes_voice.wav')

Listen to your output file and adjust your alpha and beta settings until you have the voice just right. For me the defaults output a voice that was pretty spot on for most phrases.

Considerations

  • While StyleTTS2 is a great open-source model, paid products like ElevenLabs are more precise
  • Other models evaluated were HierSpeech++ and Coqui.ai, but did not produce an improved sample
  • Running this on a MacBook can take over 5 seconds for short texts so its better to run with GPU

StyleTTS2 — https://github.com/yl4579/StyleTTS2

Contact

Open for contract projects as a Project Leader or Individual Contributor. Let’s chat!

LinkedIn: https://www.linkedin.com/in/davidrichards5/
Email: david.richards.tech (@) gmail.com