Major technology companies, including Apple, Nvidia, Salesforce, and Anthrophic, that leverage AI technologies for their products have found themselves in a new controversy. According to a report published by ProofNews, a dataset used by these companies to train AI models included subtitles from YouTube videos.
The dataset titled "YouTube Subtitles" was published in 2020, and was created by EleutherAI. The publication found that it included subtitles from 173,536 YouTube videos downloaded from over 48,000 channels.
For one, the dataset appears to have gone against YouTube"s terms and conditions, which prohibits accessing videos by "automated means." According to the publication, YouTube Subtitles is a 5.7GB (489-million-word) training dataset and includes subtitles from over 12,000 videos that have been deleted from the platform.
Video transcriptions sourced from YouTube cover a wide range of creators and channels, including those with hundreds of millions of subscribers and those with more than 100,000 subscribers. The publication writes:
Proof News also found material from YouTube megastars, including MrBeast (289 million subscribers, two videos taken for training), Marques Brownlee (19 million subscribers, seven videos taken), Jacksepticeye (nearly 31 million subscribers, 377 videos taken), and PewDiePie (111 million subscribers, 337 videos taken). Some of the material used to train AI also promoted conspiracies such as the “flat-Earth theory.”
The YouTube Subtitles dataset falls under an umbrella called "The Pile," which includes several other training datasets. Most Pile datasets are open to anyone with enough space and computing power to access it.
EleutherAI representatives didn"t respond to the publication"s request for comment on the findings and allegations of videos scrapped without permission. Many creators also didn"t respond to the publication and those who did claimed that the videos were used without their knowledge.
ProofNews searched through online posts and white papers to find evidence of AI companies using the data and "linked subtitles in the dataset to videos on YouTube in order to determine whose creative material was used to train AI models."
However, it was unable to create a comprehensive list of companies that used this dataset as AI companies don"t often disclose the data they use to train models.
Marques Brownlee, who is one of the affected creators, wrote that he uses a paid service to generate YouTube transcriptions. "So companies that scrape transcripts are stealing *paid* work in more than one way. Not great," he added. Another creator David Pakman found a video on TikTok that included a script from one of his videos and only one commenter appeared to have recognized that it was fake.
Note that Apple and other tech companies didn"t download the subtitles themselves but trained their AI models using it. However, the act is an example of the uninvited consequences of AI. The creators who talked to the publication revealed how they are uncertain about the future and the possibility that AI can be used to mimic their content.
Source: ProofNews