Article snapshot taken from Wikipedia with creative commons attribution-sharealike license.
Give it a read and then ask your questions in the chat.
We can research this topic together.
This is an old revision of this page, as edited by BrocadeRiverPoems (talk | contribs) at 21:29, 17 October 2024 (Adding maintenance tags, see Talk Page for info re: COI. The reception section of the article presently contains only positive mentions of the product so I am questioning the neutrality as well, and I have personally removed a number of citations that do not verify the contents of the article.). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.
Revision as of 21:29, 17 October 2024 by BrocadeRiverPoems (talk | contribs) (Adding maintenance tags, see Talk Page for info re: COI. The reception section of the article presently contains only positive mentions of the product so I am questioning the neutrality as well, and I have personally removed a number of citations that do not verify the contents of the article.)(diff) ← Previous revision | Latest revision (diff) | Newer revision → (diff)
Real-time text-to-speech tool using artificial intelligence
Launched in early 2020, 15.ai began as a proof of concept of the democratization of voice acting and dubbing using technology. Its gratis and non-commercial nature (with the only stipulation being that the project be properly credited when used), ease of use, no user account registration requirement, and substantial improvements to current text-to-speech implementations have been lauded by users; however, some critics and voice actors have questioned the legality and ethicality of leaving such technology publicly available and readily accessible.
Several commercial alternatives have spawned with the rising popularity of 15.ai, leading to cases of misattribution and theft. In January 2022, it was discovered that Voiceverse NFT, a company that voice actor Troy Baker announced his partnership with, had plagiarized 15.ai's work as part of their platform.
In September 2022, a year after its last stable release, 15.ai was temporarily taken down in preparation for a future update. As of October 2024, the website is still offline, with 15's most recent post being dated February 2023.
Features
HAL 9000, known for his sinister robotic voice, is one of the available characters on 15.ai.
The deep learning model used by the application is nondeterministic: each time that speech is generated from the same string of text, the intonation of the speech will be slightly different. The application also supports manually altering the emotion of a generated line using emotional contextualizers (a term coined by this project), a sentence or phrase that conveys the emotion of the take that serves as a guide for the model during inference.
Emotional contextualizers are representations of the emotional content of a sentence deduced via transfer learnedemojiembeddings using DeepMoji, a deep neural network sentiment analysis algorithm developed by the MIT Media Lab in 2017. DeepMoji was trained on 1.2 billion emoji occurrences in Twitter data from 2013 to 2017, and has been found to outperform human subjects in correctly identifying sarcasm in Tweets and other online modes of communication.
15.ai uses a multi-speaker model—hundreds of voices are trained concurrently rather than sequentially, decreasing the required training time and enabling the model to learn and generalize shared emotional context, even for voices with no exposure to such emotional context. Consequently, the entire lineup of characters in the application is powered by a single trained model, as opposed to multiple single-speaker models trained on different datasets. The lexicon used by 15.ai has been scraped from a variety of Internet sources, including Oxford Dictionaries, Wiktionary, the CMU Pronouncing Dictionary, 4chan, Reddit, and Twitter. Pronunciations of unfamiliar words are automatically deduced using phonological rules learned by the deep learning model.
The application supports a simplified version of a set of English phonetic transcriptions known as ARPABET to correct mispronunciations or to account for heteronyms—words that are spelled the same but are pronounced differently (such as the word read, which can be pronounced as either /ˈrɛd/ or /ˈriːd/ depending on its tense). While the original ARPABET codes developed in the 1970s by the Advanced Research Projects Agency supports 50 unique symbols to designate and differentiate between English phonemes, the CMU Pronouncing Dictionary's ARPABET convention (the set of transcription codes followed by 15.ai) reduces the symbol set to 39 phonemes by combining allophonic phonetic realizations into a single standard (e.g. AXR/ER; UX/UW) and using multiple common symbols together to replace syllabic consonants (e.g. EN/AH0 N). ARPABET strings can be invoked in the application by wrapping the string of phonemes in curly braces within the input box (e.g. {AA1 R P AH0 B EH2 T} to denote /ˈɑːrpəˌbɛt/, the pronunciation of the word ARPABET).
The following is a table of phonemes used by 15.ai and the CMU Pronouncing Dictionary:
In 2016, with the proposal of DeepMind's WaveNet, deep-learning-based models for speech synthesis began to gain popularity as a method of modeling waveforms and generating human-like speech. Tacotron2, a neural network architecture for speech synthesis developed by Google AI, was published in 2018 and required tens of hours of audio data to produce intelligible speech; when trained on 2 hours of speech, the model was able to produce intelligible speech with mediocre quality, and when trained on 36 minutes of speech, the model was unable to produce intelligible speech.
For years, reducing the amount of data required to train a realistic high-quality text-to-speech model has been a primary goal of scientific researchers in the field of deep learning speech synthesis. The developer of 15.ai claims that as little as 15 seconds of data is sufficient to clone a voice up to human standards, a significant reduction in the amount of data required.
A landmark case between Google and the Authors Guild in 2013 ruled that Google Books—a service that searches the full text of printed copyrighted books—was transformative, thus meeting all requirements for fair use. This case set an important legal precedent for the field of deep learning and artificial intelligence: using copyrighted material to train a discriminative model or a non-commercialgenerative model was deemed legal. The legality of commercial generative models trained using copyrighted material is still under debate; due to the black-box nature of machine learning models, any allegations of copyright infringement via direct competition would be difficult to prove.
The Pony Preservation Project from 4chan's /mlp/ board has been integral to the development of 15.ai.
The developer has also worked closely with the Pony Preservation Project from /mlp/, the My Little Ponyboard of 4chan. The Pony Preservation Project, which began in 2019, is a "collaborative effort by /mlp/ to build and curate pony datasets" with the aim of creating applications in artificial intelligence. The Friendship Is Magic voices on 15.ai were trained on a large dataset crowdsourced by the Pony Preservation Project: audio and dialogue from the show and related media—including all nine seasons of Friendship Is Magic, the 2017 movie, spinoffs, leaks, and various other content voiced by the same voice actors—were parsed, hand-transcribed, and processed to remove background noise.
Reception
15.ai has been met with largely positive reception. Liana Ruppert of Game Informer described 15.ai as "simplistically brilliant." Lauren Morton of Rock, Paper, Shotgun and Natalie Clayton of PCGamer called it "fascinating," and José Villalobos of LaPS4 wrote that it "works as easy as it looks." Users praised the ability to easily create audio of popular characters that sound believable to those unaware that the voices had been synthesized by artificial intelligence: Zack Zwiezen of Kotaku reported that " girlfriend was convinced it was a new voice line from GLaDOS' voice actor, Ellen McLain," while Rionaldi Chandraseta of Towards Data Science wrote that, upon watching a YouTube video featuring popular character voices generated by 15.ai, " first thought was the video creator used cameo.com to pay for new dialogues from the original voice actors" and stated that "the quality of voices done by 15.ai is miles ahead of ."
Reception has also been largely acclaimed overseas, especially in Japan. Takayuki Furushima of Den Fami Nico Gamer has described 15.ai as "like magic," and Yuki Kurosawa of Automaton Media called it "revolutionary."
The My Little Pony: Friendship Is Magic fandom has seen a resurgence in video and musical content creation as a direct result, inspiring a new genre of fan-created content assisted by artificial intelligence. Some fanfictions have been adapted into fully voiced "episodes": The Tax Breaks is a 17-minute long animated video rendition of a fan-written story published in 2014 that uses voices generated from 15.ai with sound effects and audio editing, emulating the episodic style of the early seasons of Friendship Is Magic.
Viral videos from the Team Fortress 2 fandom that feature voices from 15.ai include Spy is a Furry (which has gained over 3 million views on YouTube total across multiple videos) and The RED Bread Bank, both of which have inspired Source Filmmaker animated video renditions. Other fandoms have used voices from 15.ai to produce viral videos. As of July 2022, the viral video Among Us Struggles (which uses voices from Friendship Is Magic) has over 5.5 million views on YouTube; YouTubers, TikTokers, and Twitch streamers have also used 15.ai for their videos, such as FitMC's video on the history of 2b2t—one of the oldest running Minecraft servers—and datpon3's TikTok video featuring the main characters of Friendship Is Magic, which have 1.4 million and 510 thousand views, respectively.
Some users have created AI virtual assistants using 15.ai and external voice control software. One user on Twitter created a personal desktop assistant inspired by GLaDOS using 15.ai-generated dialogue in tandem with voice control system VoiceAttack, with the program being able to boot up applications, utter corresponding random dialogues, and thank the user in response to actions.
I’m partnering with @VoiceverseNFT to explore ways where together we might bring new tools to new creators to make new things, and allow everyone a chance to own & invest in the IP’s they create.
We all have a story to tell.
You can hate.
Or you can create.
What'll it be?
January 14, 2022
In December 2021, the developer of 15.ai posted on Twitter that they had no interest in incorporating non-fungible tokens (NFTs) into their work.
On January 14, 2022, it was discovered that Voiceverse NFT, a company that video game and animedubvoice actorTroy Baker announced his partnership with, had plagiarized voice lines generated from 15.ai as part of their marketing campaign. Log files showed that Voiceverse had generated audio of Twilight Sparkle and Rainbow Dash from the show My Little Pony: Friendship Is Magic using 15.ai, pitched them up to make them sound unrecognizable from the original voices, and appropriated them without proper credit to falsely market their own platform—a violation of 15.ai's terms of service.
15 @fifteenai
I've been informed that the aforementioned NFT vocal synthesis is actively attempting to appropriate my work for their own benefit.
After digging through the log files, I have evidence that some of the voices that they are taking credit for were indeed generated from my own site.
January 14, 2022
Avatar of Voiceverse Origins
Voiceverse Origins @VoiceverseNFT
Hey @fifteenai we are extremely sorry about this. The voice was indeed taken from your platform, which our marketing team used without giving proper credit. Chubbiverse team has no knowledge of this. We will make sure this never happens again.
January 14, 2022
15 @fifteenai
Go fuck yourself.
January 14, 2022
A week prior to the announcement of the partnership with Baker, Voiceverse made a (now-deleted) Twitter post directly responding to a (now-deleted) video posted by Chubbiverse—an NFT platform with which Voiceverse had partnered—showcasing an AI-generated voice and claimed that it was generated using Voiceverse's platform, remarking "I wonder who created the voice for this? ;)" A few hours after news of the partnership broke, the developer of 15.ai—having been alerted by another Twitter user asking for his opinion on the partnership, to which he speculated that it "sounds like a scam"—posted screenshots of log files that proved that a user of the website (with their IP address redacted) had submitted inputs of the exact words spoken by the AI voice in the video posted by Chubbiverse, and subsequently responded to Voiceverse's claim directly, tweeting "Certainly not you :)".
Following the tweet, Voiceverse admitted to plagiarizing voices from 15.ai as their own platform, claiming that their marketing team had used the project without giving proper credit and that the "Chubbiverse team no knowledge of this." In response to the admission, 15 tweeted "Go fuck yourself." The final tweet went viral, accruing over 75,000 total likes and 13,000 total retweets across multiple reposts.
The initial partnership between Baker and Voiceverse was met with severe backlash and universally negative reception. Critics highlighted the environmental impact of and potential for exit scams associated with NFT sales. Commentators also pointed out the irony in Baker's initial Tweet announcing the partnership, which ended with "You can hate. Or you can create. What'll it be?", hours before the public revelation that the company in question had resorted to theft instead of creating their own product. Baker responded that he appreciated people sharing their thoughts and their responses were "giving a lot to think about." He also acknowledged that the "hate/create" part in his initial Tweet might have been "a bit antagonistic," and asked fans on social media to forgive him. Two weeks later, on January 31, Baker announced that he would discontinue his partnership with Voiceverse.
The phrase "high-fidelity" in TTS research is often used to describe vocoders that are able to reconstruct waveforms with very little distortion, and is not simply synonymous with "high quality." See the papers for HiFi-GAN, GAN-TTS, and parallel WaveNet for unbiased examples of this usage of terminology.
Translated from original quote written in Spanish: "La dirección es 15.AI y funciona tan fácil como parece."
References
Notes
Kong, Jungil (2020). "HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis". arXiv:2010.05646v2 .
Binkowski, Mikołaj (2019). "High Fidelity Speech Synthesis with Adversarial Networks". arXiv:1909.11646v2 .
Felbo, Bjarke (2017). "Using millions of emoji occurrences to learn any-domain representations for detecting sentiment, emotion and sarcasm". Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. pp. 1615–1625. arXiv:1708.00524. doi:10.18653/v1/D17-1169. S2CID2493033.
Valle, Rafael (2020). "Mellotron: Multispeaker expressive voice synthesis by conditioning on rhythm, pitch and global style tokens". arXiv:1910.11997 .