AI training datasets raise ethical and legal concerns in creative industries

Recent revelations about the use of extensive film and TV dialogue datasets in AI training highlight significant ethical and legal challenges facing the creative industries.

Recent revelations have confirmed that thousands of films and television episodes have been utilised in training datasets for generative AI systems, including those developed by notable tech companies like Apple, Nvidia, Meta, and Salesforce. An extensive dataset from OpenSubtitles.org, containing dialogue from more than 53,000 films and 85,000 television episodes, has been identified as a key resource in these AI training efforts. This dataset encompasses dialogue from landmark cinematic and television works, such as every film nominated for Best Picture from 1950 to 2016, multiple episodes from beloved TV series like “The Simpsons” and “Seinfeld”, and complete runs of shows like “The Wire”, “The Sopranos”, and “Breaking Bad”.

The OpenSubtitles database is largely composed of user-uploaded subtitle files extracted from a variety of sources including DVDs and online streams. These files offer a form of written dialogue that mimics conversational nuances and rhythms, which AI developers find invaluable for training language models to engage in more natural dialogue.

Key players in the tech industry, including Nvidia, Anthropic, and EleutherAI, have utilised this dataset to train various large language models (LLMs), such as Nvidia’s NeMo Megatron, and Anthropic’s Claude, a competitor to OpenAI’s ChatGPT. Many of these AI models have been shared on platforms like Hugging Face for broader development and implementation in diverse contexts, potentially challenging human writers by creating content that mimics human dialogue and speech.

Despite widespread usage, permissions from original scriptwriters and creators were not obtained, raising significant legal questions. The lack of transparency in how creative works are being used by AI companies continues to be a contentious issue. Some tech companies justify their use of copyrighted materials in AI training under the “fair use” clause, although this defence has not yet been fully tested in court. The dataset’s derivative nature as subtitles further complicates its legal status, as they are protected under similar copyright rules that apply to the original works from which they are derived.

Open from as early as 2020, the OpenSubtitles dataset is part of a larger collection known as The Pile, which comprises various data types used to train AI systems. This includes texts from a diverse range of sources like philosophy papers, online forums, and even YouTube subtitles, offering tech companies the high-quality data necessary to avoid the burdens of sourcing and compiling this information themselves.

Despite the contentious use of these data sets, the creators of these resources, such as Jörg Tiedemann, have expressed mixed feelings about their content’s use, with Tiedemann noting that its initial development was aimed more at aiding translation services rather than AI dialogue training. As AI technology continues to evolve, and its applications broaden, discussions regarding the ethical use of creative content in this space remain ever-relevant, with ongoing debates about how artists and content creators should be rightly acknowledged and compensated.

These developments signal a critical juncture not just in AI technology but also in how intellectual property rights are regarded in the digital age. As AI systems become more advanced and increasingly prevalent, the repercussions on creative industries and the legal frameworks governing them are set to be significant.

Source: Noah Wire Services