Expanding / An AI-generated image of a silhouette of a person.

Arstecnica

On Thursday, Microsoft researchers unveiled a new text-to-speech AI model called VALL-E. This model can closely simulate a human voice given a 3-second audio sample. After learning a particular voice, VALL-E synthesizes the person saying something, trying to preserve the emotional tone of the speaker.

Its creators speculate that VALL-E could be used for high-quality text-to-speech applications, voice editing that can edit recordings of people and alter them from text transcripts (making them say things they didn’t originally say). . Creating audio content when combined with other generative AI models like GPT-3.

Microsoft calls VALL-E a “neural codec language model,” and it’s built on a technology called EnCodec that Meta announced in October 2022. Unlike other text-to-speech methods that typically synthesize speech by manipulating waveforms, VALL-E is a separate audio codec code from text and acoustic prompts. It basically analyzes a person’s voice, splits that information into discrete components (called “tokens”) thanks to EnCodec, and uses training data to determine which phrases other than three are spoken. Match what you “know” about what the voice sounds like. – second sample. Or as Microsoft says in his VALL-E paper:

To synthesize personalized speech (e.g., zero-shot TTS), VALL-E generates corresponding acoustic tokens conditional on the acoustic tokens and phoneme prompts of a 3-second registered recording. This constrains the speaker and content information respectively. Finally, the generated acoustic tokens are used to synthesize the final waveform with the corresponding neural codec decoder.

Microsoft trained VALL-E’s speech synthesis capabilities on an audio library called LibriLight created by Meta. It includes 60,000 hours of his English audio from over 7,000 speakers, most of it extracted from LibriVox public domain audiobooks. For VALL-E to produce good results, the audio of the 3-second samples must closely match the audio of the training data.

Microsoft provides many audio samples of real AI models on the VALL-E samples website. In the sample, “Speaker Prompt” is his 3 seconds of speech that VALL-E has to mimic. “Ground Truth” is an existing recording of the same speaker saying a specific phrase for comparison purposes (kind of like a “control” for the experiment). “Baseline” is an example of traditional text-to-speech synthesis, and “VALL-E” sample is the output from the VALL-E model.

Expanding / Block diagram of VALL-E provided by Microsoft researchers.

microsoft

While using VALL-E to generate these results, researchers only entered a 3-second “speaker prompt” sample and a text string (what they wanted to say audibly) into VALL-E. Compare the ‘Ground Truth’ sample with the ‘VALL-E’ sample. In some cases, the two samples are very close. While some of her VALL-E results appear to be computer-generated, others could be mistaken for human speech, the target of the model.

In addition to preserving the speaker’s vocal timbre and emotional tone, VALL-E can also mimic the “acoustic environment” of the sampled audio. For example, if the sample is from a phone call, the audio output simulates the acoustic and frequency characteristics of the phone call in the synthesized output (that’s a fancy way of saying it sounds like a phone call). Also, Microsoft’s sample (“Integration of Diversity” section) shows that VALL-E can generate variations in vocal tone by changing the random seed used in the generation process.

We were unable to test the functionality of VALL-E, presumably because Microsoft did not provide the VALL-E code for other users to experiment with, as VALL-E could facilitate mischief and deception. Researchers seem to be aware of the potential social harm this technology poses. In conclusion of their paper, they write:

“Because VALL-E can synthesize speech that preserves speaker identity, it may carry potential risks of model misuse, such as speech identification spoofing and spoofing of a particular speaker. We build a discriminative model to determine whether an audio clip was synthesized by VALL-E or not, in order to mitigate the .In addition, we also practice Microsoft’s AI principles when developing the model.”

Source link

Microsoft’s new AI can simulate anyone’s voice with 3 seconds of audio

Leave a ReplyCancel Reply