Researchers at Microsoft have unveiled an impressive new text-to-speech AI model called Vall-E. It allows you to mimic the voice (including emotional tones and acoustics) and say whatever you want, just by listening to a voice for a few seconds.

It’s the latest in a number of AI algorithms that can use recordings of a person’s voice to utter words and sentences the person has never spoken. It is also noteworthy for how small fragments of speech are required to estimate the entire human voice. For example, his Lyrebird algorithm from the University of Montreal in 2017 required him to have a minute of speech to analyze, whereas Vall-E only needs him to have a 3-second audio snippet.

The AI has been trained on about 60,000 hours of English speech, presumably mostly by audiobook narrators. Researchers have presented sample loot in which Vall-E attempts to manipulate various human voices. Some do a very good job of capturing the essence of audio and creating new sentences that sound natural. You’ll have a hard time telling which ones are real and which ones are synthesized. Others only give away when the AI puts the emphasis in odd places in the text.

Vall-E is particularly good at reproducing the audio environment of the original sample. If the sample sounds like it was recorded on a phone, so does the synthesis. Accents are pretty good too – at least American, British, and some European accents.

Emotionally, the results are less impressive. Using speech samples marked as angry, sleepy, amused, or disgusted makes things seem off track and the synthesis sounds strangely distorted.

The impact of this kind of technology is very clear. On the positive side, at some point, you can have Morgan Freeman narrate your shopping list while you’re on a trolley down the supermarket aisle. You can use the system to finish your act through deepfake video and audio. Apple recently introduced a catalog of audiobooks read aloud by AI. It should come as no surprise that you’ll soon be able to switch Narrators on the fly.

On the downside, well, not good news for voice actors or narrators…or certainly for listeners. AI could potentially create narration quickly and very cheaply, but don’t expect a lot of art from it. They don’t interpret Douglas Adams like Stephen Fry.

Scammers are also very likely. If the imposter can answer her phone for 3 seconds, he can steal your voice and call grandma. Or bypass voice recognition security her device. This is exactly what the Terminator Robot needs to make a phone call.

And of course, everyone is still waiting for the moment when the first deepfake speech by a politician tricks people into undermining the very notion of trusting your own eyes and ears. .

The Microsoft Vall-E team has added a short ethics statement at the end of the demo page. , if the model is generalized to unseen speakers, the relevant components include speech Must be accompanied by an edit model. ”

The rise of DALL-E, ChatGPT, various deepfake algorithms, and countless other artificial intelligences has reached an inflection point in recent months and feels like it’s beginning to leap out of the lab and into the real world. . Like all change, it brings opportunities and risks. We live in really interesting times.

Check out all the audio samples on the Vall-E demo page.

Source link

Microsoft’s new VALL-E AI can capture your voice in 3 seconds

Leave a ReplyCancel Reply