Microsoft Unveils AI that simulates a voice from just 3 seconds of audio

Microsoft has unveiled an AI tool called ‘VALL-E’ that can closely replicate a person’s voice and the speaker’s emotional tone.While most AI models that recreate human voices typically require at least a minute of audio recording input, or even longer, VALL-E needs just a 3-second sample.

VALL-E is a “neural codec language model,” based on a similar model from Meta that uses AI to produce text-to-speech audio.

According to the company, once the artificial intelligence system has a person’s voice recording, it is able to make it sound like that person is saying anything. It is even able to imitate the original speaker’s emotional tone and acoustic environment.

“Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot text to speech synthesis (TTS) system in terms of speech naturalness and speaker similarity,” said Microsoft.

Microsoft plans to continue developing the model to improve the accuracy and pronunciation of certain words.

The software is not available for public use as Microsoft has cited potential risks such as spoofing voice identification or impersonation.

Comments