
On the website Infinite Conversation, German filmmaker Werner Herzog and Slovenian philosopher Slavoj Žižek have an open chat about everything. Their arguments are persuasive. One reason for this is that these intellectuals have distinctive accents when speaking the English language, not to mention the tendency to choose outlandish words. But they have another thing in common. Both voices are deepfakes, and the text they speak in their distinctive accent is generated by artificial intelligence.
I created this conversation as a warning. Improvements in something called machine learning have made it too easy to create deepfakes (images, videos, or sounds that are incredibly realistic but fake) and their quality is too high. At the same time, language-generating AI can generate large amounts of text quickly and cheaply. Combined, these technologies can do more than stage endless conversations. They have the ability to drown us in a sea of disinformation.
Machine learning, an AI technique that uses large amounts of data to “train” algorithms to perform specific tasks iteratively and improve, is undergoing a period of rapid growth. This is pushing the whole field of information technology to new levels, including speech synthesis, a system that produces human-understandable speech. As someone interested in the boundary space between humans and machines, I’ve always found it a fascinating application. So when advances in machine learning have greatly improved speech synthesis and voice cloning techniques over the last few years, after a long history of small incremental improvements, I took notice.
Infinite Conversation started when I came across a typical text-to-speech program called Coqui TTS. Many projects in the digital domain start by finding previously unknown software libraries or open source programs. This tool with an active user community and lots of documentation When he found the kit, he found it had all the pieces he needed to clone famous voices.
As a viewer of Werner Herzog’s work, personas and worldviews, I have always been drawn to his voice and the way he speaks. Pop culture has made Herzog a literal cartoon, so I’m not alone. His cameos and collaborations include simpsons, rick and morty When penguins of madagascarSo when it came to tinkering with someone’s voice, there were no better options. Especially since I knew I would have to listen to that voice for hours on end. It’s almost impossible to get tired of his dry speech and heavy German accent, which conveys a solemnity that cannot be ignored.
Building the training set for cloning Herzog’s voice was the easiest part of the process. Between his interviews, narrations, and audiobook work, there are literally hundreds of hours of audio that can be collected for training machine learning models (in my case, fine-tuning existing models). increase. The output of machine learning algorithms typically improves in “epochs”, cycles in which the neural network is trained using all of the training data. The algorithm then samples results at the end of each epoch, providing material for researchers to review to assess program progress. Using Werner Herzog’s synthetic voice, listening to models that have improved over time is like witnessing a metaphorical birth, his voice slowly coming to life in the digital realm.
Once I was happy with Herzog’s voice, I moved on to the second voice and intuitively chose Slavoj Žižek. Like Herzog, Žizek has an interesting and quirky accent, a presence within the intellectual realm, and a connection to the world of cinema. He also achieved some popular stardom thanks to his controversial enthusiasm and sometimes controversial ideas.
At this point, I still had no idea what the final form of my project would be, but I was pleasantly surprised by how easy and smooth the whole voice cloning process was, so this should be a warning to everyone. I understand. who will pay attention Deepfakes have gotten too good and too easy to make. Just this month, Microsoft announced a new text-to-speech tool called his VALL-E. Researchers claim it can mimic any voice based on just three seconds of recorded speech. We are facing a crisis of trust, and we are completely unprepared for it.
To highlight the technology’s ability to generate massive amounts of disinformation, I settled on the idea of endless conversations. What I needed was a large-scale language model fine-tuned to the text written by each of the two participants, and a back-and-forth model so that the conversation flow felt natural and believable. It was just a simple program that controlled the conversation of
At its core, the language model predicts the next word in a sequence given that the sequence of words already exists. By fine-tuning the language model, we can reproduce the style and concepts that a particular person is likely to speak, but only if we have a rich transcript of that person’s dialogue. I decided to use her one of the major commercial language models available. That’s when I realized that it was already possible to generate fake dialogue, including synthetic speech, in less time than it took to listen to it. This gave us a friendly name for the project. Infinite Conversation. After months of work, we published it online last October. Infinite Conversation will also be on display at his Misalignment Museum art installation in San Francisco beginning February 11th.
Once all the pieces fit together, I was pleasantly surprised by something I hadn’t thought of when I started the project. Like their real-life personas, my chatbot versions of Herzog and Zizek frequently converse on topics of philosophy and aesthetics. The esoteric nature of these topics allows listeners to temporarily ignore the occasional nonsense that the model produces. AI Zizek’s view of Alfred Hitchcock, for example, alternates between seeing the famous director as a genius and a cynical manipulator. Another contradiction is that while the real Herzog famously dislikes chickens, his AI imitators sometimes speak compassionately about chickens. Postmodern philosophy in practice seems so jumbled that, as Žizek himself pointed out, the lack of clarity in an infinite conversation could be interpreted as a serious ambiguity rather than an impossibility of contradiction. there is.
This is believed to have contributed to the overall success of the project. Hundreds of Infinite Conversation visitors have watched him for over an hour and sometimes longer. As stated on its website, Infinite hopes that visitors to his conversations will not take what the chatbot says too seriously and be aware of the technology and its consequences. If this AI-generated chatter seems plausible, it could be used to tarnish a politician’s reputation, defraud a business his leader, or distract people with false information that sounds like human-reported news. Imagine realistic voice speech used to.
But there is also a bright side. A growing number of Infinite Conversation visitors report using the soothing voices of Werner Herzog and Slavoj Žižek as a form of white noise to help them fall asleep. That’s how I use this new technology that I can get into.
This is an opinion and analysis article and views expressed by the author or authors are not necessarily Scientific American.