Recent revelations about the use of extensive film and TV dialogue datasets in AI training highlight significant ethical and legal challenges facing the creative industries.
Recent revelations have confirmed that thousands of films and television episodes have been utilised in training datasets for generative AI systems, including those developed by notable tech companies like Apple, Nvidia, Meta, and Salesforce. An extensive dataset from OpenSubtitles.org, containing dialogue from more than 53,000 films and 85,000 television episodes, has been identified as a key resource in these AI training efforts. This dataset encompasses dialogue from landmark cinematic and television works, such as every film nominated for Best Picture from 1950 to 2016, multiple episodes from beloved TV series like “The Simpsons” and “Seinfeld”, and complete runs of shows like “The Wire”, “The Sopranos”, and “Breaking Bad”.
The OpenSubtitles database is largely composed of user-uploaded subtitle files extracted from a variety of sources including DVDs and online streams. These files offer a form of written dialogue that mimics conversational nuances and rhythms, which AI developers find invaluable for training language models to engage in more natural dialogue.
Key players in the tech industry, including Nvidia, Anthropic, and EleutherAI, have utilised this dataset to train various large language models (LLMs), such as Nvidia’s NeMo Megatron, and Anthropic’s Claude, a competitor to OpenAI’s ChatGPT. Many of these AI models have been shared on platforms like Hugging Face for broader development and implementation in diverse contexts, potentially challenging human writers by creating content that mimics human dialogue and speech.
Despite widespread usage, permissions from original scriptwriters and creators were not obtained, raising significant legal questions. The lack of transparency in how creative works are being used by AI companies continues to be a contentious issue. Some tech companies justify their use of copyrighted materials in AI training under the “fair use” clause, although this defence has not yet been fully tested in court. The dataset’s derivative nature as subtitles further complicates its legal status, as they are protected under similar copyright rules that apply to the original works from which they are derived.
Open from as early as 2020, the OpenSubtitles dataset is part of a larger collection known as The Pile, which comprises various data types used to train AI systems. This includes texts from a diverse range of sources like philosophy papers, online forums, and even YouTube subtitles, offering tech companies the high-quality data necessary to avoid the burdens of sourcing and compiling this information themselves.
Despite the contentious use of these data sets, the creators of these resources, such as Jörg Tiedemann, have expressed mixed feelings about their content’s use, with Tiedemann noting that its initial development was aimed more at aiding translation services rather than AI dialogue training. As AI technology continues to evolve, and its applications broaden, discussions regarding the ethical use of creative content in this space remain ever-relevant, with ongoing debates about how artists and content creators should be rightly acknowledged and compensated.
These developments signal a critical juncture not just in AI technology but also in how intellectual property rights are regarded in the digital age. As AI systems become more advanced and increasingly prevalent, the repercussions on creative industries and the legal frameworks governing them are set to be significant.
Source: Noah Wire Services
- https://ai.opensubtitles.com – This link corroborates the use of OpenSubtitles.org in generating and translating subtitles using AI, highlighting the platform’s capabilities and integration with the OpenSubtitles database.
- https://www.tensorflow.org/datasets/community_catalog/huggingface/open_subtitles – This link supports the existence of the OpenSubtitles dataset, detailing its composition, language coverage, and usage in machine learning models.
- https://github.com/sdtblck/Opensubtitles_dataset – This link provides evidence of the OpenSubtitles dataset being downloaded and parsed, which is used in various AI training efforts.
- https://ai.opensubtitles.com/faq – This FAQ page explains the integration of OpenSubtitles.org with AI.OpenSubtitles.com, the use of AI models, and the platform’s features, which align with the dataset’s usage in AI training.
- https://github.com/LAION-AI/Open-Assistant/issues/2747 – This link discusses a dataset based on subtitles from OpenSubtitles.org, specifically for Japanese movies and TV shows, highlighting the dataset’s extensive use in AI training.
- https://huggingface.co/datasets/Nan-Do/OpenSubtitlesJapanese – This link points to a specific dataset on Hugging Face, which is derived from OpenSubtitles.org and used in AI training, particularly for Japanese content.
- https://www.opensubtitles.org/ – This is the main website of OpenSubtitles.org, which is the source of the extensive dataset used in training AI models.
- https://huggingface.co/datasets/huggingface/open_subtitles – This link provides access to the OpenSubtitles dataset on Hugging Face, which is widely used by tech companies for training large language models.
- https://github.com/LAION-AI/Open-Assistant/tree/main/data/datasets/fd_dialogue – This link references a similar dataset used for training AI models, comparing it to the OpenSubtitles dataset and highlighting their shared use in AI development.
- https://www.noahwire.com – Although not directly linked to the specific article, this source is mentioned as the origin of the information regarding the use of OpenSubtitles in AI training and the associated legal and ethical issues.
- https://huggingface.co/blog/the-pile – This link discusses The Pile, a larger collection of datasets that includes OpenSubtitles, used for training AI systems, which aligns with the article’s mention of diverse data sources.


