-
PDF
- Split View
-
Views
-
Cite
Cite
Adam Buick, Copyright and AI training data—transparency to the rescue?, Journal of Intellectual Property Law & Practice, Volume 20, Issue 3, March 2025, Pages 182–192, https://doi.org/10.1093/jiplp/jpae102
- Share Icon Share
Abstract
Generative Artificial Intelligence (AI) models must be trained on vast quantities of data, much of which is composed of copyrighted material. However, AI developers frequently use such content without seeking permission from rightsholders, leading to calls for requirements to disclose information on the contents of AI training data. These demands have won an early success through the inclusion of such requirements in the EU’s AI Act.
This article argues that such transparency requirements alone cannot rescue us from the difficult question of how best to respond to the fundamental challenges generative AI poses to copyright law. This is because the impact of transparency requirements is contingent on existing copyright laws; if these do not adequately address the challenges presented by generative AI, transparency will not provide a solution. This is exemplified by the transparency requirements of the AI Act, which are explicitly designed to facilitate the enforcement of the right to opt-out of text and data mining under the Copyright in the Digital Single Market Directive. Because the transparency requirements do not sufficiently address the underlying flaws of this opt-out, they are unlikely to provide any meaningful improvement to the position of individual rightsholders.
Transparency requirements are thus a necessary but not sufficient measure to achieve a fair and equitable balance between innovation and protection for rightsholders. Policymakers must therefore look beyond such requirements and consider further action to address the complex challenge presented to copyright law by generative AI.
1. Introduction
Since the debut of ChatGPT in late 2022, the attention of policy makers around the world has been captured by generative Artificial Intelligence (AI)—that is, AI models that are capable of creating data such as text, images, audio or video content.1 Widely viewed as a technology with transformative potential, generative AI is expected by many to add trillions of dollars to the global economy over the next decade,2 prompting numerous governments to declare increased innovation and investment in the technology as a key policy goal.3 In addition to the promised opportunities, however, generative AI also presents policymakers with significant challenges, such as how to respond to the technology’s potential to replicate or amplify existing biases and prejudices, spread misinformation and threaten the livelihoods of human workers (especially in the creative industries).4 The law is therefore faced with the difficult question of how the harms of generative AI can be mitigated without stifling innovation.
Given its role in governing the ownership and use of creative works, it is inevitable that copyright is one of the areas of law at the forefront of this regulatory challenge. Generative AI raises profound questions regarding the foundational assumptions that underpin copyright law,5 one of the most pressing of which concerns the data used to train generative AI models. These models require immense quantities of data, with the largest training datasets comprising millions of text documents, images, audio samples, or other forms of content.6 Most of this material is protected by copyright, but AI developers have frequently made little or no effort to seek the permission of rightsholders for the use of their works. As a result, rightsholders have brought dozens of court cases against AI developers in multiple jurisdictions on the grounds that this unauthorized use of their works constitutes copyright infringement.7 The AI developers counter that this use is covered by the various exceptions to the copyright holder’s otherwise exclusive right to authorize the reproduction of their work.8 If rightsholders are successful in even some of these cases, the resulting damages could be sufficient to bankrupt even the largest AI developers.9 Thus while generative AI serves as a ‘stress test’ for copyright law,10 copyright law, in turn, poses a potentially ‘existential threat’ to generative AI.11
In tandem with this increased attention regarding AI training data, developers have become markedly more secretive regarding the contents of such data—a trend that few believe is coincidental. As a result, organizations representing rightsholders across the world are now calling for AI developers to be required by law to be transparent regarding the contents of their training datasets, with the aim of enabling rightsholders to enforce their rights over their content.12 These calls for action have already led to significant policy developments. At the intergovernmental level, the Hiroshima AI Process Principles, agreed by the G7 nations in 2023, call for the implementation of appropriate measures to protect personal data and intellectual property, including through appropriate transparency of training datasets.13 Legislators have introduced bills that would mandate training data transparency.14 Most notably of all, training data transparency requirements are a feature of the EU’s AI Act, which was adopted in August 2024.15
This article argues that while such transparency requirements are not without merit, they will not by themselves rescue us from the complex task of balancing the interests of rightsholders, AI developers and society as a whole. This is because, in the context of copyright law, transparency requirements simply facilitate the enforcement of the law as it currently stands. Given that the legality of using copyright works to train generative AI models varies widely between jurisdictions, the impact of transparency requirements will thus also vary. Furthermore, given the scale of the challenges generative AI poses to the law of copyright, it is unlikely that present copyright laws will, by default, adequately address these issues in many (perhaps most) jurisdictions. Both these points are illustrated by the transparency provisions of the EU’s AI Act, which are explicitly designed to facilitate the enforcement of the widely criticized right to ‘opt-out’ of the text and data mining (TDM) exception under Article 4 of the Copyright in the Digital Single Market (CDSM) Directive.16 Because the transparency provisions do not address the fundamental problems with the opt-out, individual creators are unlikely to see any significant material benefit from these transparency provisions—which will nevertheless place additional burdens on AI developers.
Transparency requirements are therefore a necessary but insufficient condition to achieve a desirable outcome in this area. Policymakers should instead look beyond such requirements and engage with the difficult question of how to balance the competing interests of all relevant stakeholders. This is as much a question of social priorities as legal mechanisms, and the answers will depend on the specific legal, economic and cultural contexts of different jurisdictions.
The remainder of this article is structured as follows. Section 2 offers a concise overview of how generative AI models are trained using data, along with arguments in favour of training data transparency and the methods by which it might be achieved. Section 3 discusses the copyright implications of the unauthorized use of copyrighted works in AI training data, with a focus on how the legality of such use (and consequently the impact of transparency requirements) varies significantly between jurisdictions. Section 4 provides a detailed examination of the transparency requirements of the EU’s AI Act and argues that these are unlikely to provide meaningful material benefits to individual authors. Section 5 concludes.
2. Generative AI and training data
To fully grasp the significance of the debate around copyright and training data, it is necessary to first understand how such data is used to train a modern generative AI model. While AI models that could be described as ‘generative’ in some sense have existed for decades, the current wave of popular generative AI models such as OpenAI’s GPT series or Stability AI’s image generators are based on a subtype of machine learning known as ‘deep learning’.17 Like other forms of machine learning, deep learning makes use of ‘neural networks’—that is, connected units of nodes inspired by the structure of the human brain. What distinguishes deep learning from other neural network-based approaches is that it makes use of multiple layers of nodes, referred to as ‘deep’ layers. When information is passed through these deep layers, it is processed at different levels of complexity, with early layers typically identifying simple patterns and subsequent layers building on this foundation to recognize patterns of increasing complexity. This enables AI models to ‘learn’ from large quantities of data.18 In a generative AI model based on deep learning, the patterns and rules identified during the training process can then be leveraged to create new content.19 The extent to which the final generative AI model retains the data it has been trained on is not entirely clear. While it is generally accepted that AI models encode the patterns derived from the data during the deep-learning process as numerical parameters rather than storing the entire training dataset,20 in some cases generative AI models can recreate identical or near-identical copies of material found within their training data—a phenomenon known as ‘memorization’.21 The copyright implications of this uncertainty regarding the retention of training data are discussed further in Section 3 below.
Modern generative AI models require truly mind-boggling quantities of training data. The case of GPT-3, the breakthrough Large Language Model (LLM) announced by OpenAI in 2020, provides an illustration. In the initial paper describing the development and capabilities of GPT-3, the authors revealed that the data the model had been trained upon consisted of a refined version of the Common Crawl dataset,22 a smaller dataset of higher-quality web-based text called WebText2,23 two datasets of books and the English-language version of Wikipedia.24 In addition to this general training data, an AI model may then be further trained on a smaller, curated dataset in order to refine the model’s capabilities to a particular domain in order to ‘fine-tune’ it.25 As a result, a generative AI model may have been trained on millions of individual works.
2.1 Training data transparency
As noted in the introduction, there are widespread calls for the developers of AI models to be transparent regarding the content of their vast training datasets. While in many cases, these demands are driven by concerns regarding the unauthorized use of copyrighted content, it is worth noting that there are also non-copyright arguments in favour of training data transparency. For example, calls for training data transparency are also motivated by concerns that the content of training data may lead to biased or otherwise inequitable results.26 While all AI developers claim to test their models for such bias prior to release, training data transparency allows external parties—including those with different perspectives and priorities from the developers—to examine the training data in ways that would be beyond the scope of any individual team.27 Relatedly, transparency may help to build public trust in AI models by reducing the information asymmetry between model providers and consumers, thereby avoiding a ‘market for lemons’ scenario in which demand for generative AI models collapses.28
Despite these arguments in favour of transparency, however, AI developers have become markedly more secretive regarding their training data in recent years. Many major AI developers have shifted from detailed explanations of the training data used to train a particular model to single sentence descriptions.29 For example, while OpenAI disclosed the main sources of data for GPT-3, the paper introducing GPT-4 revealed only that the data on which the model had been trained was a mixture of ‘publicly available data (such as internet data) and data licensed from third-party providers’.30 The motivations behind this move away from transparency have not been articulated in any particular detail by AI developers, who in many cases have given no explanation at all. For its part, OpenAI justified its decision not to release further details regarding GPT-4 on the basis of concerns regarding ‘the competitive landscape and the safety implications of large-scale models’, with no further explanation within the report.31 Some limited additional elaboration about both arguments was subsequently provided by Ilya Sutskever, then OpenAI’s Chief Scientist, in an interview in March 2023. Sutskever clarified that OpenAI believes that sharing further details regarding their training data would facilitate the replication of their cutting-edge AI models, while releasing detailed information regarding training data would enable careless or malicious actors to develop their own powerful AI models more easily.32
While these lines of reasoning have some merit, there are also clear objections to both arguments from a public policy perspective. Preventing rivals from replicating innovative technology without investing comparable resources is a common goal for many firms, but one that is only occasionally in the public interest; furthermore, the absence of training data transparency requirements could itself facilitate anti-competitive practices, such as by enabling the largest AI developers to enter into preferential licensing agreements with entities with access to large pools of training data.33 And while the proliferation of potentially dangerous AI technology is a valid concern, this argument could justify any measure aimed at hindering the market entry of competitors. It is not clear why restricting access to training data would be an especially effective way to prevent dangerous AI tools from falling into the hands of bad actors, especially compared to restricting access to more sensitive information such as the weights of a particular AI model.34 Additionally, as noted above, failure to disclose details of AI training data also has the potential to cause harm by making it more difficult for regulators and third parties to identify potentially harmful or discriminatory behaviour that result from that data. In short, even if there are benefits to withholding information regarding training data, these come with obvious downsides; it would be inappropriate to leave the decision on how best to balance the competing concerns solely to AI companies, given that these companies have a vested interest in preventing the release of such data.
The official arguments that have been presented against transparency are therefore largely unconvincing; it is widely speculated that the primary motivation behind the increasing opacity with regards to training data is instead a desire by AI developers to avoid or minimize liability for infringement of copyright present in the training data. An investigation by the Washington Post in April 2024 reported that many companies involved in the development of AI do not even keep internal records of their training data because of fears that this could be used as evidence of copyright infringement or breach of data protection law.35 While the use of copyrighted content in training data is not necessarily infringement in many jurisdictions, as further discussed below, AI developers certainly have little to gain by inviting potential liability in this area.
2.2 Models of transparency
Training data transparency has multiple benefits, and the arguments against providing such transparency not particularly persuasive. How, then, can such transparency be achieved? There are several different approaches, which provide varying levels of information regarding the contents of the data in question. The approach that achieves the highest degree of transparency is for AI developers to make the datasets used to train a particular AI model fully publicly accessible. Under this ‘full access’ approach, third parties (including rightsholders) can then view the training data themselves, and independently verify the content that has been used. Given that an AI developer must have access to the full dataset in order to complete the training process, this approach should always be possible in principle (assuming that the developer has not deleted the data once training is complete). Indeed, some developers have taken this approach, making full copies of their training data available online.36
Providing full access to the training data is unlikely to be workable for the majority of AI models, however. First, there are logistical challenges associated with hosting an accessible repository of a training dataset containing hundreds of thousands or millions of individual works. Secondly, such an approach is, ironically, likely to come into conflict with copyright law; even if a rightsholder has agreed for their work to be included in a dataset through a licensing deal, they are unlikely to be happy for their works to effectively be made freely available for other developers through the fully accessible dataset. Similar issues arise regarding any personal data that might be contained within the dataset. One way to avoid the copyright and personal data issues of the full access approach is to permit users to request access to restricted parts of a datasets which can then been approved or declined by the dataset owner upon the provision and verification of credentials—this has been referred to as ‘gated access.’37 Creating and maintaining the infrastructure necessary to manage gated access to a dataset, however, adds to the already significant logistical issues associated with full access.
Most discussions around increasing training data transparency focus on the provision of some kind of summary that provides key information on the dataset. This is much less burdensome than providing direct access to the content of the dataset—but its usefulness is heavily dependent on the information that is contained in the summary. In theory, such a summary could contain metadata on each item within a dataset—for example, the title, URL (if relevant), author, date of publication etc—thus allowing individual works to be identified. However, such data are often inaccurate or non-existent, especially in the case of data scraped directly from the internet. Providing or ensuring the accuracy of even basic information such as title or author for individual items would be resource intensive, potentially driving out smaller developers and thereby increasing market concentration.38 A number of frameworks for providing detailed summaries of training data without listing individual items have already been developed within the AI community—for example, The Dataset Nutrition Label and Datasheets for Datasets.39
From the perspective of a rightsholder concerned that their works may have been used without their permission, transparency is useful chiefly to the extent that it enables them to establish whether or not a particular work appears in a given dataset. If training data summaries do not identify individual works, rightsholders will at least want a clear explanation of the sources of the data used—for example, existing datasets, the domains of data scraped from the internet, etc—to assess the likelihood that their works were used to train a specific AI model. This assumes, however, that once the unauthorized use of a given work has been identified, bringing and succeeding with a copyright infringement claim will be straightforward. As discussed in the following section, in practice the situation is more complicated, and varies significantly between jurisdictions.
3. Training data and copyright
3.1 The use of copyright content in machine learning
As a starting point, it is certainly the case that a large majority of the content used to train most generative AI models will be protected by the law of copyright. This is an inevitable result of the fact that copyright arises automatically for any work that meets a minimal set of requirements,40 and that the term of this protection is lengthy—at least the life of the author plus 50 years.41 While alternatives to making use of works that raise copyright issues do exist, these are not a viable replacement for the use of copyright works, at least at the time of writing. For example, while there is a large body of public domain works for which copyright has expired, almost all of these works will date from before the 1950s—any AI trained exclusively on such works would therefore be hopelessly outdated.42 Some copyrighted material, such as Wikipedia, is licensed under Creative Commons (CC) or other ‘copyleft’ licences which may permit the use of that content for training generative AI models.43 However, since only a small fraction of copyright protected content has been made available under such licences, it is unlikely that cutting-edge generative AI models could be trained exclusively on CC-licenced content.44 Additionally, the output of AI models trained using such content might be bound by the ‘Share Alike’ obligations CC licences often impose on derivative works—something that commercial AI developers would likely wish to avoid.45 It has also been suggested that at some point in the future some or all of the real-world data used to train AI systems could be replaced with ‘synthetic data’—that is, AI-generated data that is intended to closely resemble real-world data and thus act as a replacement for it.46 However, AI experts are divided on whether synthetic data will ever be able to meaningfully replace real-world data, and if so, when this will be possible.47
The fact that most AI training data are protected by copyright raises a number of problems. As noted above, deep learning-based generative AI models do not typically store copies of their training data—rather, the patterns derived from the data are encoded as numerical parameters. For this reason, many academic commentators have concluded that, in most cases, generative AI models cannot be considered to infringe the copyright in any of the works they were trained upon by their mere existence.48 However, this view is complicated somewhat by the phenomenon of memorization, whereby generative AI models can sometimes reproduce verbatim or near-verbatim portions of their training data.49 It is therefore conceivable that a court might still find a generative AI model to ‘contain’ a work on the basis of its ability to reproduce that work, even if the data is not stored within the model’s memory as it would be on a hard drive. Memorization also raises the issue of generative AI models infringing copyright through their output; if the output of a model directly reproduces some part of its training data, this could clearly be a potential infringement of the right of reproduction.50 However, even if a model’s output is not a verbatim or near-verbatim copy any part of its training data, it might nevertheless infringe copyright if it contains recognizable protected elements of a work, such as a fictional character.51 Beyond the reproduction right, model output might also infringe other exclusive rights, such as the right to authorize translations of a work, adaptions of a work, or to communicate a work to the public.52
For the purposes of this article, however, the most important issue is the fact that, in the majority of cases, the training data must be reproduced at least once as part of the training process.53 This is at least arguably a prima facie infringement of the right of reproduction. It should be noted that some commentators have argued that both acts of temporary electronic reproduction and ‘non-expressive’ uses of works should not fall within the scope of copyright protection at all; if this were the case, most (if not all) of the copying involved in the training of a generative AI model would simply fall outside the scope of copyright protection entirely.54 However, most of the current debate around the use of copyrighted materials in AI training data is based on the assumption that the reproductions which take place during the training of a generative AI model are indeed copyright-relevant acts, and therefore require the permission of the rightsholder unless a relevant exception applies—a position that has already been confirmed by official sources in both the UK and EU.55
Clearing the rights for the extremely large number of works used to train an AI model would be exceedingly difficult. Even setting aside the expense of paying some kind of licence fee for the use of each work, the transaction costs associated with identifying and negotiating with individual rightsholders would be prohibitive.56 As already noted, AI developers have largely sidestepped this problem by simply making use of content without any meaningful effort to identify or seek permission from the rightsholders.57 This unauthorized reproduction of the copyrighted works forms the basis of the case against the AI developers in the majority of the ongoing training data litigation.58
3.2 Training data and copyright exceptions
As noted in the introduction, the defence of the AI developers to the accusations of mass infringement lies in the exceptions and limitations to copyright protection.59 However, as detailed below, the scope and application of these exceptions and limitations differs considerably between jurisdictions—which in turn profoundly influences the potential impact of introducing requirements for training data transparency.
3.2.1 The EU
Under EU law, Member States must provide a closed system of copyright exceptions—that is, the reproduction of a work without prior authorization will only be permitted if it falls within one of several specific exceptions. While a number of such exceptions are relevant to the process of training a generative AI model,60 the most significant of these comes from Articles 3 and 4 of the 2019 CDSM Directive, which permit ‘reproductions and extractions of lawfully accessible works and other subject matter’ for the purposes of text and data mining (TDM). While the CDSM Directive preceded the current hype around generative AI by several years, TDM is given a broad definition which covers most forms of machine learning,61 and the AI Act explicitly acknowledges the relevance of the TDM exceptions as relevant to the AI training process.62 Article 3 of the CDSM Directive permits TDM by ‘research organizations and cultural heritage institutions’ for the purposes of scientific research,63 while Article 4 permits TDM for any purpose—including by private companies for commercial reasons.64 Significantly, however, the latter of these exceptions is predicated on the condition that the works and other subject matter being used for the purposes of TDM have not been ‘expressly reserved by their rightholders in an appropriate manner such as machine-readable means in the case of content made publicly available online’ at Article 4(3).65 The preamble to the Directive specifies that machine-readable means are the only appropriate means of reserving rights for content made publicly available online.66
This effective ‘veto’ over the use of a work for commercial TDM purposes was included to strengthen the position of rightsholders, theoretically allowing them to negotiate access to their works.67 This has led to criticism that the opt-out will inhibit the AI industry in the EU by increasing the cost of developing new AI models.68 However, there is considerable doubt as to the effectiveness of the opt-out in achieving its goal in practice. Commentators have identified two major barriers to the use of the opt-out. Firstly, there are currently no generally recognized standards or protocols for a machine-readable means of opting out of the Article 4 exception; consequently, there is no means for rightsholders to consistently reserve their rights, particularly for online content.69 Secondly, and more fundamentally, unless rightsholders are aware that their work has been used for the purposes of TDM, they have no way of knowing whether or not their opt-out has been respected.70 This significantly limits the usefulness of the opt-out in practice, and consequently undermines the bargaining power that the provision was meant to provide to rightsholders. As discussed further in Section 4, the transparency provisions of the EU’s AI Act are explicitly aimed at addressing this deficiency.
3.2.2 The USA
In the USA, in contrast, the copyright and training data question turns on the doctrine of fair use. Fair use is an open exception to copyright, meaning there is no predefined list of activities permitted by the defence; rather, the fairness of a particular use must be addressed on a case-by-case basis. US jurisprudence emphasizes four factors as especially important in making such a determination—these are the purpose and character of the use, the nature of the copyrighted work, the amount and substantiality of the portion used in relation to the copyrighted work as a whole, and the effect of the use upon the potential market for or value of the copyrighted work.71
It is not yet clear under what circumstances, if any, the use of copyrighted material for the purposes of training a generative AI system will constitute fair use under US law. However, jurisprudence over the past 20 years has shown that the ‘purpose and character’ factor is especially important; almost all high-profile fair use cases in that time period have turned on the question of whether the purpose and character of the use in question can be said to be ‘transformative.’72 A use is transformative if it ‘adds something new, with a further purpose or different character, altering the first with new expression, meaning, or message’.73 This would seem to support the argument that the use of a work to train a generative AI model is covered by the defence, given that the purpose of such training is to allow the model to generate new content. There is precedent for finding large-scale machine copying to be fair use based on its transformational character, most notably the Authors Guild v HathiTrust74 and Authors Guild v Google75 cases. Of course, this must be set against the fact that the use of a work as part of the training data for a generative AI model is usually done for commercial reasons, involves that work being copied in its entirety, and may reduce the demand for the original work by enabling the production of similar, competing works. It is worth noting that in Google, the transformative nature of the use was seen as enough to outweigh the fact that Google is a for-profit company and that the entire works were copied,76 although the machine copying in that case did not raise the same concerns about market impact.
3.2.3 Other jurisdictions
The approach to AI-relevant copyright exceptions varies further amongst other jurisdictions. Some have introduced bespoke exceptions that enable machine learning from copyrighted materials, like the EU—although generally without the option of opting-out. Japan, for example, introduced an exception in 2018 that permits the use of copyright materials without rightsholder permission for a broad range of computing-related uses, including TDM.77 Similarly, Singapore has also recently introduced a new copyright exception that permits copying for ‘computational data analysis.’78 A number of jurisdictions have open-ended copyright exceptions that may or may not apply to the training of AI models, as in the USA—significantly, this group includes China, the world’s second most important hub for AI development.79 In some countries with such open exceptions, governments have moved to clarify that the use of copyright materials in training data is covered by existing exceptions—for example, the Israeli Ministry of Justice has issued an opinion that in most circumstances, the use of copyrighted materials is permitted under the existing fair use doctrines of Israeli copyright law.80 A third group of jurisdictions have a closed list of exceptions that either do not permit, or place significant limitations on, TDM or other machine-learning related uses of copyrighted materials. The UK, for example, permits TDM of copyrighted works only for non-commercial uses.81
3.2.4 Likelihood of future divergence
This range of approaches to the unauthorized use of copyrighted materials in training data is likely to diverge further, both because some governments will likely seek to introduce further exceptions to facilitate training in order to promote the AI industry domestically, and because challenges to pro-training copyright exceptions are probable. Particularly relevant to this second point is the fact that in the overwhelming majority of countries, any exception to the reproduction right must conform to the ‘three-step test’ under the Berne Convention; that is, the exception must be for a specific purpose, not conflict with the normal exploitation of the work, and not unreasonably harm the legitimate interests of the author.82 Generally speaking, a generative AI model will be capable, inter alia, of creating content that is similar to that on which it was trained; such content is a potential substitute for the original, and may therefore have a negative impact on its value. This means that exceptions that permit the unauthorized use of copyrighted materials for the purposes of training generative AI models may fall foul of the third element of the Berne three-step test, especially when the models are being trained for commercial purposes.83 A need to take the potential economic impact on the rightsholder is explicitly considered in the relevant exception in some jurisdictions.84 However, copyright exceptions in all Berne Convention signatory countries are vulnerable to this line of argument. An interpretation of the relationship between exceptions relevant to AI training and the three-step test that leads to further harmonization may eventually emerge from an international body, such as a WTO Dispute Settlement Panel. Until then, however, the test will be interpreted and applied by national courts, likely leading to further divergence in national approaches to these exceptions.
3.3 The varied impact of training data transparency
The lawfulness of using copyrighted materials without prior authorization from the rightsholder therefore varies significantly between legal systems—in some, the reproductions involved in the training process may not even constitute copyright-relevant acts, while in others these will require the explicit permission of the rightsholder outside of limited exceptions, with most jurisdictions falling somewhere in between. Consequently, the impact of a legal requirement for training data transparency will also vary. For example, regardless of whether or not the use of works in training data is found to be covered by fair use under US law, transparency requirements would have a very different impact in the USA compared to the EU, since fair use does not allow for rightsholders to opt-out while the EU’s Article 4(3) CDSM exception does.85 While transparency can offer a valuable tool for the scrutiny of the content used in training data, its impact is ultimately constrained by the underlying copyright framework within a particular jurisdiction. This exposes an apparent flaw in the reasoning of the pro-rightsholder organisations mentioned in the introduction, whose advocacy for training data transparency requirements across various jurisdictions appears to be based on the assumption that similar requirements will lead to similar (pro-rightsholder) outcomes, irrespective of the surrounding legal context.
4. Training data transparency in the AI Act
The training data provisions of the EU’s recently passed AI Act exemplify both how the impact of transparency requirements is determined by a jurisdiction’s existing copyright laws and the limitations of implementing transparency requirements without also appropriately revisiting and, if necessary, revising those laws. As noted above, the transparency provisions contained within the AI Act are explicitly intended to facilitate the opt-out mechanism contained in Article 4(3) of the CDSM Directive. While these requirements will likely provide some useful information on the sources of the training data of AI models deployed in the EU, they are unlikely to meaningfully improve the position of rightsholders due in part to the pre-existing flaws in the opt-out they are designed to give effect to.
The transparency requirements of the AI Act are only one small part of a very large piece of legislation that runs to 180 recitals and 113 articles, most of which is formulated as product safety/consumer protection legislation and does not engage with intellectual property law.86 Indeed, the initial proposal for the AI Act did not include provisions addressing copyright law or training data transparency at all.87 However, following the surge of public interest in generative AI that accompanied the release of ChatGPT in November 2022, groups representing the creative industries and other rightsholders demanded that measures to prevent the unauthorized use of their content to train generative AI be added to the AI Act.88 This resulted in the inclusion of a provision in the negotiating position adopted by the European Parliament in June 2023, which would have required providers of generative AI to ‘document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law.’89
Critics, however, were quick to observe that providing even a ‘summary’ of the copyrighted materials used in a training dataset would be unworkable in practice, given the vast number of individual works used, the low requirements for copyright protection to arise, and the fact that most copyright works are not actively managed by their owners.90 It would appear that this criticism was also recognized by the drafters of the Act; in the final version, the provision on AI training data transparency and copyright has been split into two closely related provisions at Article 53(1)(c) and (d).
Article 53(1)(c) requires providers of general-purpose AI models to ‘put in place a policy to comply with Union law on copyright and related rights, and in particular to identify and comply with, including through state-of-the-art technologies, a reservation of rights expressed pursuant to Article 4(3) of Directive (EU) 2019/790’. Article 53(1)(d) requires providers of general-purpose AI models to ‘draw up and make publicly available a sufficiently detailed summary about the content used for training of the general-purpose AI model, according to a template provided by the AI Office.’
Along with the other obligations in Article 53, the training data transparency and copyright policy provisions apply to providers of general-purpose AI models. ‘Providers’ are defined elsewhere in the AI Act as ‘… a natural or legal person, public authority, agency or other body that develops an AI system or a general-purpose AI model or that has an AI system or a general-purpose AI model developed and places it on the market or puts the AI system into service under its own name or trade mark, whether for payment or free of charge.’91 A GPAI model is defined as one that ‘displays significant generality and is capable of competently performing a wide range of distinct tasks regardless of the way the model is placed on the market and that can be integrated into a variety of downstream systems or applications.’92 It is worth noting that AI Act distinguishes between AI models and AI systems; a GPAI model is an essential component of a GPAI system, but does not become a GPAI system without the addition of further components such as a user interface.93
4.1 Meaning of Article 53(1)(c) and (d)
While brief, the requirements of Article 53(1)(c) and (d) raise a number of important questions. Perhaps the most fundamental of these is exactly what information is required for a summary of the training content to be considered ‘sufficiently detailed’. Some clarification is provided by the preamble, particularly Recital 107, which states in part that:
While taking into due account the need to protect trade secrets and confidential information, this summary should be generally comprehensive in its scope instead of technically detailed to facilitate parties with legitimate interests, including copyright holders, to exercise and enforce their rights under Union law, for example by listing the main data collections or sets that went into training the model, such as large private or public databases or data archives, and by providing a narrative explanation about other data sources used.94
Despite this guidance, considerable uncertainty remains as to what is required of GPAI model providers. The summaries under Article 53(1)(d) are clearly intended to give a broad overview of the sources of the data in the training data rather than a detailed breakdown of the specific works used, yet must also contain enough information to allow rightsholders, as well as other parties with legitimate interests (a term that is not defined), to exercise and enforce their rights under Union law. While the recital stresses the importance of protecting the trade secrets and other confidential information of AI developers, this protection should not be an overriding concern; an attempt by the drafters of the AI Act to balance the interests of rightsholders and others against those of AI developers. The extent to which this balance has been achieved will become clearer when the AI Office releases its template for the summary (which according to the Preamble should be ‘simple, effective and allow the provider to provide the required summary in narrative form’).95
The requirements of Article 53(1)(c) are clearer, as they specifically compel AI providers to respect the opt-out in Article 4(3) of the CDSM Directive. As noted at Section 3.2.1, the relevance of this exception to generative AI is confirmed by the preamble, which clarifies that the use of copyright-protected content for the purposes of training and development of GPAI models requires ‘the authorisation of the rightsholder concerned unless relevant copyright exceptions and limitations apply’.96 The preamble further acknowledges that, while Directive 2019/790 introduces exceptions and limitations for the purpose of text and data mining, rightsholders can opt out of such TDM unless done for the purposes of scientific research, and that where the right to opt out has been exercised, providers of general-purpose AI models need to obtain rightsholder authorization.97
The mention of ‘state-of-the-art’ technologies in Article 53(1)(c) links to the CDSM Directive’s objective that machine-readable means be used to express the opt-out. The use of the term ‘state-of-the-art’ suggests that GPAI model providers must continually update these means as the technology facilitating the identification and compliance with the opt-out improve—although as noted above, there is still no generally recognized standard for exercising the opt-out in Article 4(3) at time of writing.
It is not clear, however, what precisely is meant by the references to ‘Union law on copyright and related rights’98 and ‘Union copyright law.’99 As Alexander Peukert observes, while the copyright laws of Member States have become increasingly harmonized over the last three decades, each Member State has its own national copyright regime. There is also no copyright equivalent of the unitary EU trade mark or Community design: the exclusive rights granted by a national copyright apply only within a country’s territory.100 At present, it is ambiguous as to whether the obligation to comply with Union copyright law should be interpreted as referring to an obligation to respect the collective national copyright laws of each Member State, or to comply only with those elements of copyright law that have been harmonized by EU law. Whatever is meant by ‘Union copyright law’, there is nothing in Article 53(1)(c) or the preamble to suggest this only applies to input data (although this was clearly the chief concern of the drafters). As such, policies to comply with Union copyright law will presumably need to also consider infringement through output, as discussed at Section 3.1.101
4.2 Evaluating the effectiveness of Article 53(1)(c) and (d)
The AI Act’s preamble clarifies that the goals of the transparency and copyright protection provisions are to protect the interests of rightsholders as well as other parties with a ‘legitimate interest’ in the contents of training data.102 Article 53 of the AI Act will apply 12 months after the date of the entry into force of the AI Act,103 although codes of practice covering the obligations in Article 53, including the ‘adequate level of detail for the summary about the content used for training’, will be ready no later than nine months after the Act’s entry into force.104 The full implications of Article 53(1)(c) and (d) will only become apparent once this clarification of the requirements arrives.
However, even without knowing the full details of what the provisions will require, there are compelling grounds for scepticism as to whether Article 53(1)(c) and (d) will deliver meaningful material benefits to individual rightsholders. Three major obstacles to the effectiveness of the provisions in achieving their goals stand out: the lack level of detail offered by the training data summaries, challenges enforcing the provisions in the case of AI models developed outside of the EU, and technical and logistical issues relating to the implementation of Article 4(3) of the CDSM Directive.
4.2.1 The lack of detail offered by the training data summaries
The requirement to provide summaries of training data at least nominally addresses one of the major issues with Article 4(3) of the CDSM Directive by assisting rightsholders in verifying whether their opt-outs have been respected. However, even without knowing the level of detail that will be required by the AI Office’s template, there is reason to doubt how useful these summaries will be. As noted at Section 2.2, training data transparency benefits rightsholders largely to the extent that it enables them to establish whether or not a particular work appears in a given dataset. As the preamble specifies that summaries should not be so detailed as to identify individual works, it is unclear how rightsholders will be able to determine whether any reservation of rights on their part has been respected. An obvious solution would be for AI providers to retain detailed records of the training data used in addition to the public summaries, in case of challenges from regulators or rightsholders. However, this is complicated by the fact that under Article 4(2) of the CDSM Directive, reproductions and extractions of works made under Article 4 may only be retained ‘for as long as is necessary for the purposes of text and data mining,’105 meaning that the copies should be deleted once the training process is completed.106 The matter of how rightsholders will be able to prove whether or not their works have been used in a particular training dataset therefore remains outstanding.
4.2.2 Challenges enforcing the provisions in the case of AI models developed outside of the EU
At present, nearly all major centres of the AI industry are located outside of the EU.107 This invites the question as to how Article 53(1)(c) and (d) can be effectively enforced for the vast majority of generative AI models that are developed somewhere other than the EU. The preamble attempts to address this by stating that the policy to comply with Union copyright laws and give effect to the reservation of rights per Article 4(3) CDSM Directive applies to any provider that places a general-purpose AI model on the EU market, regardless of where the training took place, in order to ensure a ‘level playing field’ and prevent providers gaining a competitive advantage by training their models outside of the EU.108 It is not clear, however, that this approach will produce the intended result. As has already been observed, copyright is territorial in nature, and AI models are generally understood not to contain the works they have been trained upon.109 As a result, if the unauthorized reproductions of copyrighted material necessary for the training of an AI model were to be carried out entirely in a third country whose law permits such use without rightsholder permission, there would be no copyright infringement in either country or within the territory of an EU Member State.110 Under this conventional understanding of copyright law, an AI provider could therefore ensure compliance with Union copyright law by making sure that none of their training data had been collected from servers based within the EU, and that all training took place outside of the EU’s borders.
Alternatively, some have interpreted the recital as meaning that GPAI models will be barred from entry into the EU market for failing to respect opt-outs under Article 4(3) of the CDSM Directive even when their training has taken place entirely in a jurisdiction which does not permit such a reservation of rights.111 In this scenario, an AI product could be excluded from the EU market in order to protect the interests of rightsholders, despite no actionable copyright infringement having ever occurred. This outcome would be highly unusual from the perspective of copyright theory. João Pedro Quintais reasonably points out that it would be problematic for something as radical as the de facto extraterritorial effect of copyright to be introduced ‘through the back door’ of a non-binding recital.112 If the first interpretation of the recital holds, however, one of the main impacts of Article 53(1)(c) is likely to be that it heavily incentivize AI developers to ensure that the training of their models takes place outside of the EU.
4.2.3 Technical and logistical issues relating to the implementation of Article 4(30 of the CDSM Directive
Even if both of the issues discussed above can be overcome, however, the emergence of a viable market in which authors receive meaningful compensation for the use of their works in AI training data remains unlikely due to the inherent flaws of Article 4(3). A significant obstacle, of course, is the aforementioned lack of a widely accepted protocol for the reservation of rights under Article 4(3). Beyond this, however, a major and possibly insurmountable logistical barrier remains. As noted above, the transaction costs associated with negotiating a licence fee with rightsholders for such a large volume of works would be prohibitive. Given the sheer quantity of works involved, even a minor transaction cost per work is likely to be enough to render any approach based on the negotiation of licenses with individual rightsholders entirely unfeasible.113
Some potential solutions to this problem have been offered. For example, it is possible that an automated licensing system could be developed, with rightsholders expressing the terms (e.g. payment of a particular fee) under which they would be prepared to waive their opt-out in machine-readable form, which bots deployed to acquire training data could detect and comply with.114 However, much of the content available online has not been posted by the legitimate rightsholder, and the creation of a market for the licensing of works would incentivize dishonest actors to impersonate legitimate rightsholders. It is not clear how this approach would address the critical issue of verifying whether or not the entity offering to waive the opt-out over a work is, in fact, the rightful rightsholder.
Another, potentially complementary, solution to these logistical challenges would be the establishment of a collective rights management (CRM) organization in order to manage the conditional waiver of the Article 4(3) opt-out.115 CRM has been a successful means of clearing the rights associated with large numbers of individual works with highly fragmented owners in other areas, most notably the music industry.116 A centralized CRM organization for the management of training data licences would also provide a means to verify the legitimate rightsholders of works. However, the number of works involved in training a large AI model far exceeds even the biggest repertoires of works currently managed through CRM.117 The sheer volume of content involved in the training of generative AI models also means that it is extremely difficult to devise any remuneration system that could yield significant payments to individual creators—particularly sums that would adequately compensate for the loss of business that many creatives fear will be the result of increasing use of generative AI.118 Furthermore, while existing CRM organizations tend to govern licences for one particular type of work—such as pieces of music—an organization dedicated to the CRM of training data would have to manage a very wide variety of works, from books to songs, photographs, videos and social media posts.119 Given the complexity and scale involved, such a CRM organization would likely need to be directly established by a government, or at least require substantial government backing.120 Neither appears to be in the offing.
4.2.4 The likely impact of Article 53(1)(c) and (d)
While the issues discussed above cast doubt on whether Article 53(1)(c) and (d) will provide any meaningful material benefit to individual authors, there is little doubt that complying with the provisions will impose additional costs on AI developers. This could further hamper the EU’s already comparatively underpowered AI industry, an issue that ties into a broader concern that the AI Act will leave Europe’s tech industry ‘hiring lawyers while the rest of the world is hiring coders.’121 Moreover, the provisions may also discourage AI developers from launching AI products in the EU, as the required disclosures regarding training data sources could also assist rightsholders in bringing claims and negotiating licence fees in other jurisdictions.122 It is therefore possible that the transparency provisions of the AI Act could produce a ‘lose-lose’ scenario in which developers are deterred from launching new AI models in the EU, AI development shifts outside of the bloc, and rightsholders receive no additional compensation.123
Such an extreme scenario seems unlikely, however. This is partly because AI developers have the option of concluding licenses with organizations with large portfolios of works, such as publishers, which many already do.124 These agreements would simplify compliance with Article 53(1)(c) and (d); AI developers could cite the works included in the licensing agreement in their training data summaries, while Union copyright law would be respected through the terms of that agreement. Publishers and other gatekeeper organizations will be incentivized to impose standard contract terms requiring authors to waive their right to opt-out of TDM to facilitate further such agreements; multiple commentators have noted the asymmetry in bargaining power between authors and such organizations means that it could become difficult for professional authors to refuse such terms125 The most likely outcome of the transparency provisions of the AI Act may therefore be that the providers of generative AI models conclude licensing agreements with publishers and other organizations with access to large bodies of high-quality content in order to meet their obligations under Article 53(1)(c) and (d).126 It is unlikely that much of this licensing revenue will reach individual authors, especially given that the amounts of money involved in the deals struck between publishers and AI developers to date are relatively small given the large number of works covered.127
5. Conclusion—transparency to the rescue?
This article has demonstrated that, while requirements for training data transparency have a number of clear benefits, much of the impact of such requirements is dependent on local copyright law—leading to widely varying outcomes between different jurisdictions. Such requirements do not and cannot resolve the complex challenges surrounding the use of copyrighted materials to train generative AI models by themselves.
This is clearly illustrated by the transparency requirements of the EU’s AI Act. As noted, there are a number of outstanding questions regarding the meaning of these provisions which will be clarified in the coming months and years. However, it is already clear that reliance on transparency requirements, supplemented with a requirement for a policy to respect Union copyright law, is a misguided approach to the drafter’s presumed goal of ensuring that individual authors are compensated for the use of their works in AI training data. Such requirements were never going to overcome the inherent logistical challenges posed by implementing the CDSMD opt-out. A better, although more challenging, approach to achieving this aim would have been instead to focus on creating new legal mechanisms that would avoid the issues associated with Article 4(3) CDSM Directive.128 Because they do not address the fundamental flaws in the existing framework for the use of copyrighted content by AI developers in the EU, the transparency provisions of the AI Act are unlikely to provide any meaningful improvement to the material condition of individual authors.
To be clear, none of this diminishes the advantages of requirements for training data transparency. However, policymakers around the world must now turn their attention beyond such requirements to the difficult task of how and to what extent the law of copyright should be amended to balance the interests of the various groups impacted by generative AI. There is no ‘one-size-fits-all’ solution here: the nature of how best to achieve this will vary depending on the specific legal, economic and cultural context of a given jurisdiction. In some cases, further measures to protect rightsholders may be appropriate. In others, particularly those at risk of falling behind in the global AI race, the priority may be to ensure that copyright law does not restrict the development of a domestic AI industry. Policymakers around the world should engage closely with key stakeholders to determine the most effective policies for their specific contexts. Given the rapid pace of AI development, these policies should be frequently reassessed to ensure that they remain relevant and effective. This must be managed amid both the sometimes-exaggerated hype regarding AI’s economic potential, as well as a growing backlash against AI technology from the public, particularly those employed in creative industries.
It has been suggested by some that the challenge of generative AI is so profound as to herald the end of copyright law.129 However, many new technologies—for example, radio, cassettes, home video, and especially the internet—have prompted premature predictions of copyright’s demise. While the law of copyright will undoubtedly be roiled by the fundamental questions raised by generative AI for years to come, it is likely that it will ultimately adapt, just as it has with previous technological advances. This does not mean that policymakers should be complacent; rather, decisive action is needed now to ensure that the correct balance is struck between the incentivisation of innovation and protection of rightsholders. Crafting effective policy to regulate a new technology during the early stages of its development is particularly difficult, as this is when the least is known about its societal impacts. Yet, as the technology develops and its consequences become clearer, it also becomes more socially and economically entrenched—with the result that implementing policies to control the technology is much more difficult.130 The issue of entrenchment is especially pertinent to generative AI, as major tech companies are rapidly integrating these systems into widely used applications. It is therefore vital that policymakers act thoughtfully but swiftly to ensure that copyright law develops in a way that balances the interests of all stakeholders fairly. While training data transparency is an important tool in this effort, it cannot rescue us from the difficult questions of how this balance should be achieved.
Footnotes
Adam Zewe, ‘Explained: Generative AI’ (MIT Schwarzman College of Computing, 9 November 2023). Available at https://computing.mit.edu/news/explained-generative-ai/ (accessed 14 October 2024).
Bloomberg Intelligence, Generative AI 2024 Report (Bloomberg 2024).
See eg The White House (USA) Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, 30 October 2023, available at https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/ accessed 1 November 2024; Ministère de l’Économie, des Finances et de la Relance (France), Stratégie Nationale pour l’Intelligence Artificielle, 22 May 2024, available at https://www.economie.gouv.fr/strategie-nationale-intelligence-artificielle (accessed 1 November 2024); HM Government (UK), National AI Strategy, available at https://www.gov.uk/government/publications/national-ai-strategy, 18 December 2022, (accessed 1 November 2024).
There is already evidence that generative AI is behind a recent drop in demand for the services of some freelance workers—see further Ozge Demirci, Jonas Hannane and Xinrong Zhu, ‘Who Is AI Replacing? The Impact of Generative AI on Online Freelancing Platforms’ (2024) CESifo Working Paper 11 276.
See further, Mark Lemley, ‘How Generative AI turns Copyright Upside Down’ (2024) 25 Columbia Science & Technology Law Review 190.
Zewe (n 1).
For example, the Authors Guild’s case against OpenAI in the USA, the case filed by photographer Robert Kneschke against LAION in Germany, and the claim brough by Getty Images against Stability AI in the UK; for more details on these cases and (many) other examples, see further Mishcon de Reya, ‘Generative AI – Intellectual Property Cases and Policy Tracker’ (Mischon de Reya, 12 August 2024). Available at https://www.mishcon.com/generative-ai-intellectual-property-cases-and-policy-tracker (accessed 1 November 2024).
See eg Stability AI, ‘Response to USCO Inquiry on Artificial Intelligence and Copyright’ (October 2023), 8; Hugging Face, ‘Hugging Face Response to the Copyright Office Notice of Inquiry on Artificial Intelligence and Copyright’ (November 2023), 9; Anthropic ‘Notification of Inquiry Regarding Artificial Intelligence and Copyright Public Comments of Anthropic PBC’ (October 2023), 3; Google ‘Artificial Intelligence and Copyright’ (October 2023), 8–11.
Elizabeth Lopatto, ‘OpenAI Searches for an Answer to its Copyright Problems’ (The Verge, 30 August 2024) Available at https://www.theverge.com/2024/8/30/24230975/openai-publisher-deals-web-search (accessed 1 November 2024).
Daryl Lim, ‘Generative AI and Copyright: Principles, Priorities and Practicalities’ (2023) 18 Journal of Intellectual Property Law & Practice 841.
Pamela Samuelson, ‘Generative AI Meets Copyright’ (University of California, Berkeley, 26 April 2023). Available at https://news.berkeley.edu/2023/05/16/generative-ai-meets-copyright-law/ (accessed 1 November 2024).
See eg Professional Photographers of America, ‘PPA’s Comments on the Copyright’s Office NOI on Generative Artificial Intelligence’ (Professional Photographers of America, 17 November 2023). Available at https://www.ppa.com/articles/ppas-comments-on-the-copyrights-office-noi-on-generative-artificial-intelligence (accessed 1 November 2024); Staff, ‘Global Principles on Artificial Intelligence (AI)’ (News/Media Alliance, 6 September 2023). Available at https://www.newsmediaalliance.org/global-principles-on-artificial-intelligence-ai/ (accessed 1 November 2024); CISAC, ‘Australian Creators Welcome Establishment of Copyright and AI Reference Group’ (CISAC, 12 December 2023) Available at https://www.cisac.org/Newsroom/society-news/australian-creators-welcome-establishment-copyright-and-ai-reference-group (accessed 1 November 2024).
Hiroshima Process International Guiding Principles for Organizations Developing Advanced AI System (2023), 5.
See, eg, California Senate Bill 942 ‘California AI Transparency Act’ and US House Bill 7913 ‘Generative AI Copyright Disclosure Act of 2024ʹ.
Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down harmonized rules on artificial intelligence (Artificial Intelligence Act) [hereafter ‘AI Act’], art 53(1(c) and (d).
See further Eleonora Rosati, ‘Copyright as an Obstacle or an Enabler? A European Perspective on Text and Data Mining and its Role in the Development of AI Creativity’ (2019) 27 Asia Pacific Law Review 198; Thomas Margoni and Martin Kretschmer, ‘A Deeper Look into the EU Text and Data Mining Exceptions: Harmonisation, Data Ownership, and the Future of Technology’ (2022) 71 GRUR International 685; Paul Keller and Zuzanna Warso, Defining Best Practices of Opting Out of ML Training (Open Future 2023); Gina Maria Ziaja, ‘The Text and Data Mining Opt-Out in Article 4(3) CDSMD: Adequate Veto Right for Rightholders or a Suffocating Blanket for European Artificial Intelligence Innovations?’ (2024) 10 Journal of Intellectual Property Law & Practice 453.
All references to ‘generative AI’ throughout this article should be understood as meaning deep learning-based generative AI models unless otherwise stated.
Ian Goodfellow, Yoshua Benigo and Aaron Courville, Deep Learning (MIT Press 2016) 6–8.
Zewe (n 1).
Pamela Samuelson, ‘Generative AI Meets Copyright’ (2023) 381 Science 158, 159; Matthew Sag, ‘Copyright Safety for Generative AI’ (2023) 61 Houston Law Review 295, 316–321.
See further Ivo Emanuilov and Thomas Margoni, ‘Forget Me Not: Memorisation in Generative Sequence Models Training on Open Source Licensed Code’ (2024) SSRN. Available at https://ssrn.com/abstract=4720990 (accessed 1 November 2024).
The Common Crawl database consists of ‘web page data, metadata extracts, and text extracts’ taken from the internet since 2008; Common Crawl, ‘Overview’ (Common Crawl, 2024). Available at https://commoncrawl.org/overview (accessed 1 November 2024).
The WebText2 dataset is composed of the text of popular outbound links from the social media site Reddit; The New York Times Company v OpenAI LP and Microsoft Corporation, Complaint, US District Court for the Southern District of New York, filed 27 December 2023, page 26. Available at https://nytco-assets.nytimes.com/2023/12/NYT_Complaint_Dec2023.pdf (accessed 1 November 2024).
Tom Brown and others, ‘Language Models are Few-shot Learners’ (2020) arXiv preprint arXiv:2005.14165, 9.
Dave Bergmann, ‘What is Fine-tuning?’ (IBM, 15 March 2024), Available at https://www.ibm.com/topics/fine-tuning (accessed 1 November 2024).
Mark Lemley and Bryan Casey, ‘Fair Learning’ (2021) 99 Texas Law Review 743, 757.
Yacine Jernite, ‘Training Data Transparency in AI: Tools, Trends, and Policy Recommendations’ (Hugging Face, 5 December 2023). Available at https://huggingface.co/blog/yjernite/data-transparency (accessed 1 November 2024).
For an in-depth discussion of the ‘market for lemons’ phenomenon, see George Akerlof, ‘The Market for “Lemons”: Quality Uncertainty and the Market Mechanism’ (1970) 84 The Quarterly Journal of Economics 488. For a discussion of how the requirement to be transparent with safety and efficacy works to the benefit of companies developing new products in other contexts, such as the pharmaceutical industry, see further Ariel Katz, ‘Pharmaceutical Lemons: Innovation and Regulation in the Drug Industry’ (2007) 14 Michigan Telecommunications & Technology Law Review 1.
Jernite (n 27).
OpenAI, ‘GPT-4 Technical Report’ (2023) arXiv preprint arXiv:2303.08774, 2.
ibid.
Sutskever said ‘On the competitive landscape front — it’s competitive out there… GPT-4 is not easy to develop. It took pretty much all of OpenAI working together for a very long time to produce this thing. And there are many many companies who want to do the same thing, so from a competitive side, you can see this as a maturation of the field… On the safety side … [t]hese models are very potent and they’re becoming more and more potent. At some point it will be quite easy, if one wanted, to cause a great deal of harm with those models. And as the capabilities get higher it makes sense that you don’t want want [sic] to disclose them’; James Vincent, ‘OpenAI co-founder on company’s past approach to openly sharing research: “We were wrong”’ (The Verge 15 March 2023), Available at https://www.theverge.com/2023/3/15/23640180/openai-gpt-4-launch-closed-research-ilya-sutskever-interview (accessed 1 November 2024).
Zuzanna Warso, Maximilian Gahntz and Paul Keller, Sufficiently Detailed? A Proposal for Implementing the AI Act’s Training Data Transparency Requirements for GPAI (Open Future, 2024).
Lawrence Lessig, ‘Not all AI Models should be Freely Available, Argues a Legal Scholar’ (The Economist, 29 July 2024), Available at https://www.economist.com/by-invitation/2024/07/29/not-all-ai-models-should-be-freely-available-argues-a-legal-scholar (accessed 1 November 2024).
Kevin Schaul, Szu Yu Chen and Nitasha Tiku, ‘Inside the Secret List of Websites that make AI like ChatGPT Sound Smart’ (The Washington Post, 19 April 2024), Available at https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/ (accessed 1 November 2024).
See eg Zhengzhong Liu and others, ‘LLM360: Towards Fully Transparent Open-Source LLMs’ (2023) arXiv preprint arXiv:2312.06550.
Aleck Tarkowski and Zuzanna Warso, Commons-Based Data Set Governance for AI (Open Future, 2024), 10.
Katharina de la Durantaye, ‘Garbage in, Garbage Out’ (2023), 17. Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4572952 (accessed 1 November 2024).
Sarah Holland and others, ‘The Dataset Nutrition Label’ (2018) 12 Data Protection and Privacy 1; Timnit Gebru and others, ‘Datasheets for Datasets’ (2021) 64 Communications of the ACM 86.
Berne Convention for the Protection of Literary and Artistic Works (1886) art 5.
Agreement on Trade-Related Aspects of Intellectual Property (1994) art 12. In many developed countries, including the USA, UK, and EU Member States, the term of protection is the life of the author plus 70 years.
Aside from the fact that they are less likely to have been digitized, such works will be ignorant of modern developments, will use outdated language and will be less likely to represent authors from marginalized backgrounds. Moreover, they will be significantly more likely to contain views that we would now rightly recognise as abhorrent. Sag (n 20), 338.
This depends on the conditions of a particular CC license. In their online FAQ, the Creative Commons organization notes that ‘If someone uses a CC-licensed work with any new or developing technology, and if copyright permission is required, then the CC license allows that use without the need to seek permission from the copyright owner so long as the license conditions are respected’ [emphasis added]; Creative Commons, ‘Frequently Asked Questions’ (Creative Commons, 6 June 2024) Available at https://creativecommons.org/faq/#what-are-the-limits-on-how-cc-licensed-works-can-be-used-in-the-development-of-new-technologies-such-as-training-of-artificial-intelligence-software (accessed 1 November 2024).
Although for an interesting attempt to overcome this problem, see Aaron Gokaslan et al, ‘CommonCanvas: Open Diffusion Models Trained on Creative-Commons Image’ (2024) Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition 8250.
See further Kacper Szkalej and Martin Senftleben, ‘Generative AI and Creative Commons Licences: The Application of Share Alike Obligation to Trained Models, Curated Datasets and AI Output’ (2024). Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4872366 (accessed 1 November 2024). Note that as things currently stand, CC licenses generally state that ‘Share Alike’ obligations do not apply if the work is used under a copyright exception; Szkalej and Senftleben (2024), 12.
James Jordan and others, ‘Synthetic Data—What, Why and How?’ (2022) arXiv preprint arXiv:2205.03257, 4.
ibid 36.
De la Durantaye (n 38), 4–6; Sag (n 20), 313–25; Szkalej and Senftleben (n 45), 8.
See further Nicholas Carlini et al, ‘Quantifying Memorization Across Neural Language Models’ (2022) arXiv preprint arXiv:2202.07646.
The right of reproduction is set out in the Berne Convention (1886) at art 9(1). Most (but not all) of the Berne Convention member states have gone on to sign the WIPO Copyright Treaty (1996), the Agreed Statements to which clarify that ‘the storage of a protected work in digital form in an electronic medium constitutes a reproduction within the meaning of Article 9 of the Berne Convention’; WIPO, ‘Agreed statements concerning the WIPO Copyright Treaty’ (20 December 1996) TRT/WCT/002, 1.
Matthew Sag has dubbed this as the ‘Snoopy Problem.’ Sag (n 20), 327.
The exclusive rights to authorise the translation of a work or any adaptions of a work are set out at art 8 and art 12, respectively of the Berne Convention (1886), while the exclusive right to communicate a work to the public is set out at art 8 of the WIPO Copyright Treat (1996).
Lemley and Casey (n 26), 753.
See eg Jenny Quang, ‘Does training AI violate copyright law?’ (2021) 36 Berkeley Technology Law Review 1407; Matthew Jockers, Matthew Sag and Jason Schultz, ‘Don’t let copyright block data mining’ (2012) 490 Nature 29.
The UK Intellectual Property Office, ‘Consultation outcome
Artificial intelligence call for views: copyright and related rights’ (Gov.uk, 23 March 2021), Available at https://www.gov.uk/government/consultations/artificial-intelligence-and-intellectual-property-call-for-views/artificial-intelligence-call-for-views-copyright-and-related-rights (accessed 1 November 2024); as discussed below, this is also confirmed by the EU AI Act at Recital 105.
Lemley and Casey (n 26), 759.
Lopatto (n 9).
Mishcon de Reya (n 7).
See eg Anthropic PBC, Notification of Inquiry Regarding Artificial Intelligence and Copyright, Public Comments of Anthropic PBC (Anthropic PBC 2023), Available at http://www.openfuture.eu/wp-content/uploads/2023/11/231111_copyright_offoce_noi_anthropic.pdf (accessed 1 November 2024).
For example, the exception for ‘temporary acts of reproduction’ under art 5(1) of Directive 2001/29/EC of the European Parliament and of the Council of 22 May 2001 on the harmonization of certain aspects of copyright and related rights in the information society [InfoSoc Directive].
The definition given states that text and data mining means ‘any automated analytical technique aimed at analysing text and data in digital form in order to generate information which includes but is not limited to patterns, trends and correlations’; Directive (EU) 2019/790 of the European Parliament and of the Council of 17 April 2019 on copyright and related rights in the Digital Single Market [hereafter CDSM Directive] art 2(2).
AI Act Recital 105.
While generally better received than the art 4 exception, art 3 of the CDSM Directive has been criticised on the grounds that researchers not affiliated with a research organisation or cultural heritage institution from benefitting from the exception, even if they are operating in the same manner as their institutionally-affiliated peers; Christophe Geiger, Giancarlo Frosio and Oleksandr Bulayenko, ‘Text and Data Mining: Articles 3 and 4 of the Directive 2019/790/EU’ (2019) Centre for International Intellectual Property Studies Research Paper No 2019–08, 32. Available at https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3470653 (accessed 1 November 2024).
CDSM Directive arts 3–4.
CDSM Directive art 4(3).
CDSM Directive Recital 18.
Ziaja (n 16), 454.
Artha Dermawan, ‘Text and Data Mining Exceptions in the Development of Generative AI Models: What the EU Member States could Learn from the Japanese “Nonenjoyment” Purposes?’ (2023) 27 The Journal of World Intellectual Property 44, 53.
However, this may change in future; see further Paul Keller and Zuzanna Warso, Defining best practices of opting out of ML training (Open Future 2023).
Ziaja (n 16), 456.
17 USC s 107.
Daniel Gervais, ‘A Social Utility Conception of Fair Use’ (2022) Vanderbilt Law Research Paper 22–35, 3.
Campbell v. Acuff-Rose Music, Inc 510 US 569 (1994).
Authors Guild v HathiTrust 755 F 3d 87 (2d Cir 2014).
Authors Guild v Google 804 F 3d 202 (2d Cir 2015).
ibid.
Japan, Amendment of the Copyright Act 2018, art 30–4. As Tatsuhiro Ueno notes, this exception is broader than that found in the CDSM Directive since it applies both to commercial and non-commercial uses, does not permit opt-outs from rightsholders, permits exploitation ‘by any means’, and does not require ‘lawful access’. This is balanced by the fact that the exception does not apply if the exploitation ‘would unreasonably prejudice the interests of the copyright owner.’ Tatsuhiro Ueno, ‘The flexible copyright exception for “non-enjoyment” purposes – recent amendments in Japan and its implication’ (2021) 70 GRUR International 145.
Singapore, Copyright Act 2021, s 244(2)(a). See further David Tan, ‘Designing a Future-Ready Copyright Regime in Singapore: Quick Wins and Missed Opportunities’ (2021) 70 GRUR International 1131.
Yudong Chen, ‘The Legality of Artificial Intelligence’s Unauthorised Use of Copyrighted Materials under China and U.S. Law’ (2023) 63 IDEA 241, 260.
Israel, Ministry of Justice, Opinion: Uses of Copyrighted Materials for Machine Learning (2022), 3.
UK, Copyright, Designs and Patents Act 1988 section 29A.
Berne Convention (1886) art 9(2). For further discussion on the Berne three-step test, especially as it applies to TDM provisions, see Eleonora Rosati, ‘No Step-free Copyright Exceptions: The Role of the Three-step in Defining Permitted Uses of Protected Content (Including TDM for AI-Training Purposes)’ (2024) 46 European Intellectual Property Review 262.
Saliltorn Thongmeensuk, ‘Rethinking Copyright Exceptions in the Era of Generative AI: Balancing Innovation and Intellectual Property Protection’ (2024) The Journal of World Intellectual Property 278, 284.
For example, under the fourth element of the fair use test in the USA. Similarly, the Japanese exception only applies if the use in question would not unreasonably prejudice the interests of the copyright owner; Amendment of the Copyright Act 2018, art 30–4.
de la Durantaye (n 38), 7.
Alexander Peukert, ‘Copyright in the Artificial Intelligence Act – A Primer’ (2024) 73 GRUR International 497.
European Commission, ‘Proposal for a Regulation of the European Parliament and of the Council laying down harmonised rules on Artificial Intelligence (Artificial Intelligence Act) and amending certain Union legislative acts’ COM/2021/206 final.
See eg Communia, ‘Policy paper #15 on using copyrighted works for teaching the machine’ (Communia, 26 April 2023), Available at https://communia-association.org/policy-paper/policy-paper-15-on-using-copyrighted-works-for-teaching-the-machine/ (accessed 1 November 2024); Authors’ Rights Initiative, ‘Call for Safeguards Around Generative AI’ (Authors’ Rights Initiative, 19 April 2023), Available at https://urheber.info/diskurs/call-for-safeguards-around-generative-ai (accessed 1 November 2024).
Amendments adopted by the European Parliament on 14 June 2023 on the proposal for a regulation of the European Parliament and of the Council on laying down harmonized rules on artificial intelligence (Artificial Intelligence Act) and amending certain Union legislative acts, art 28b(4)(a).
João Pedro Quintais, ‘Generative AI, copyright and the AI Act’ (Kluwer Copyright Blog, 9 May 2023), Available at https://copyrightblog.kluweriplaw.com/2023/05/09/generative-ai-copyright-and-the-ai-act/ (accessed 1 November 2024); de la Durantaye (n 38), 16–17.
AI Act art 3(3).
ibid art 3(63).
AI Act Recital 97.
ibid 107.
ibid.
ibid 105.
ibid.
AI Act art 53(1)(c).
AI Act Recital 104.
Peukert (n 86), 504.
ibid 507.
AI Act Recitals 104–108.
AI Act art 113(b).
AI Act art 56.
CDSM Directive art 4(2).
Rossana Ducato and Alain Strowel, ‘Ensuring Text and Data Mining: Remaining Issues with the EU Copyright Exceptions and Possible Ways Out’ (2021) 43 European Intellectual Property Review 322, 328.
The Economist, ‘Europe, a Laggard in AI, Seizes the Lead in its Regulation’ (The Economist, 10 December 2023), Available at https://www.economist.com/europe/2023/12/10/europe-a-laggard-in-ai-seizes-the-lead-in-its-regulation (accessed 1 November 2024).
AI Act Recital 106.
However, it is important to keep in mind the memorization issue discussed in s X, above. The CJEU has established that the reproduction of very short excerpts of a work (such as an 11-word headline in Infopaq C-5/08 and a two-second music clip in Pelham C-476/17) can still amount to copyright infringement. As such, if even minor remnants of the training data are somehow retained in the final AI model, this is likely to cause major issues under EU law; Szkalej and Senftleben (n 45), 9.
Peukert (n 86), 505–06.
Lutz Riede, Oliver Talhoff and Matthais Hofer, ‘The AI Act: Calling for Global Compliance with EU Copyright?’ (Freshfields Bruckhaus Deringer Technology Quotient, 5 April 2024), Available at https://technologyquotient.freshfields.com/post/102j4jw/the-ai-act-calling-for-global-compliance-with-eu-copyright (accessed 1 November 2024); Maureen Daly and Sarah Power, ‘European Council prepares for debate on copyright under AI Act’ (Pinsent Masons Out-Law, 15 July 2024), Available at https://www.pinsentmasons.com/out-law/news/eu-council-prepares-for-debate-on-copyright-under-ai-act (accessed 1 November 2024); Christian Frank and Gregor Schmid, ‘AI, the Artificial Intelligence Act & Copyright’ (Taylor Wessig, 13 May 2024), Available at https://www.taylorwessing.com/en/insights-and-events/insights/2024/05/ai-act-und-copyright (accessed 1 November 2024).
João Pedro Quintais, ‘Generative AI, Copyright and the AI Act’ (2024), 13. SSRN: Available at https://ssrn.com/abstract=4912701 (accessed 1 November 2024).
Martin Senftleben, ‘AI Act and Author Remuneration - A Model for Other Regions?’ (2024), 10. SSRN: Available athttps://ssrn.com/abstract=4740268 (accessed 1 November 2024).
ibid.
Stanley Besen, ‘An Economic Analysis of the Artificial Intelligence-Copyright Nexus’ (2023) TechREG Chronicle 3, 8.
See further Daniel Gervais (ed), Collective Management of Copyright and Related Rights (2nd edn Kluwer Law International Alphen aan den Rijn 2010).
Besen (n 115), 8.
de la Durantaye (n 38), 11.
Besen (n 115), 8.
ibid.
Javier Espinoza, ‘Europe’s Rushed Attempt to Set the Rules for AI’ (Financial Times, 16 July 2024), Available at https://www.ft.com/content/6cc7847a-2fc5-4df0-b113-a435d6426c81 (accessed 1 September 2024).
Senftleben (n 113), 12.
ibid.
Lopatto (n 9).
Ziaja (n 16), 455.
Quintais (n 90), 17
Lopatto (n 9).
Alternative to such mechanisms has been proposed. For example, Martin Senftleben has suggested that instead of requiring prior authorization from a rightsholders or providing an opt-out, AI developers could instead be required to pay a compulsory levy for the use of copyrighted works—which would ensure that compensation was passed on to individual authors while avoiding the transaction costs associated with managing individual opt-outs and licenses. See further Martin Senftleben, ‘Generative AI and Author Remuneration’ (2023) 54 International Review of Intellectual Property and Competition Law 1535.
Alex Reisner, ‘Generative AI Is Challenging a 234-Year-Old Law’ (The Atlantic, 29 February 2024), Available at https://www.theatlantic.com/technology/archive/2024/02/generative-ai-lawsuits-copyright-fair-use/677595 (accessed 1 November 2024).
See further David Collingridge, The Social Control of Technology (Francis Pinter London 1980).
Author notes
Lecturer in Law, School of Law, Ulster University, Belfast, UK.
Email: [email protected].