Read this: AI is training itself on Hollywood's subtitles more than its scripts
A report from The Atlantic found every Best Picture nominee from 1950 to 2016, at least 616 episodes of The Simpsons, and more in an AI training set.
Screenshot: NetflixNot even those crazy, could-only-come-from-a-human-brain subtitles like “tentacles roiling wetly” from Stranger Things are safe from the AI slop pile. A new report from The Atlantic asserts that scriptwriters anxious about their hard work and proprietary content being used to train the thing that’s trying to take their jobs really have nothing to worry about. It’s just using the subtitles that capture the language they wrote with their human hearts and human brains, not the scripts themselves. See? Much better!
According to the outlet, subtitles from approximately 53,000 movies and 85,000 TV episodes were found in a large AI-training data set used by Apple, Anthropic, Meta, Nvidia, Salesforce, Bloomberg, and more. Among those titles are reportedly every film nominated for Best Picture from 1950 to 2016, at least 616 episodes of The Simpsons, 170 episodes of Seinfeld, 45 episodes of Twin Peaks, and every episode of The Wire, The Sopranos, and Breaking Bad. The set also includes data from books, YouTube video captions, and even subtitles capturing prewritten dialogue from various awards shows.
Wanna see if your favorite film or show is included in the set? The Atlantic has a search tool included in their report. (It probably is.)
All of this data comes from a site called OpenSubtitles.org, which started with a noble purpose—to aid Google Translate and other translation tools—but seems like it has always been a bit sketchy copyright-wise. At least someone’s happy about this development; Jörg Tiedemann, one of the data set’s creators, reportedly told The Atlantic that he was perfectly fine with OpenSubtitles being used to further erode the hard work of writers rooms even though that was not his original intention.
So why use subtitles instead of actual screenplays? According to the outlet, subtitles are “valuable because they’re a raw form of written dialogue” that mirror the rhythms and intricacies of spoken conversation. “Well-written speech is a rare commodity in the world of AI-training data, and it may be especially valuable for training chatbots to ‘speak’ naturally,” the report continues. All of which begs the question: if this technology so desperately needs to steal others’ “well-written speech” to use its own voice, should it really be speaking at all?