17/10/24

AI and copyright: exploring exceptions for text and data mining

The rapid advancement of generative AI raises significant copyright concerns, particularly around the use of copyright protected content for training AI models through text and data mining (TDM). AI developers must determine whether their TDM activities require rights holders’ authorisation or whether they can rely on copyright exceptions. Meanwhile, rights holders need to manage their online content and may need to explicitly reserve their rights to prevent AI developers from using their content for TDM purposes.

In a previous Law-Now article, we addressed the issue of copyright infringement when using protected material for training generative AI tools. This article will guide you through the complexities of copyright exceptions in the EU, with a reference to US law.

1.  Copyright exceptions

In Belgium, where copyright is largely harmonised at the European level, there is a closed system of exceptions to the exclusive rights of the authors. This means that there are specific, legally defined exceptions in which the use of copyrighted material is allowed without the authorisation of the rights holder. The relevant exceptions for TDM are laid down in the European InfoSoc Directive and the DSM Directive, which have been transposed into the national laws of EU member states.

1.1  Text and data mining

As discussed in our previous Law-Now article, the DSM Directive introduced two TDM exceptions that allow the reproduction and extraction of lawfully accessible copyrighted content without the rights holder’s prior consent. These exceptions apply to scientific research by research and cultural institutions, but they can also apply to commercial purposes, provided that the rights holders have the possibility to opt-out.

Despite some controversy, the AI Act (EU Regulation N° 24/1689) has explicitly declared these TDM exceptions to be applicable to general-purpose AI models. These exceptions and their limitations apply to all providers introducing a general-purpose AI model to the EU market, irrespective of where the copyright-related activities involved in training these models occur.

Both exceptions require lawful access to the protected content, which can be obtained through various means, such as licensing agreements, subscriptions or open access. Copyright protected content that is freely available online can be extracted and used for commercial purposes, such as training an AI model.

Rights holders can, however, reserve their rights over their copyrighted works to prevent TDM, except when it is conducted for scientific research purposes. If rights have been explicitly reserved through an appropriate opt-out mechanism (i.e. by machine readable means for online content), developers of AI models must obtain authorisation from the rights holders to perform TDM on such works. The lack of clear guidance on how rights holders can effectively opt out creates legal uncertainty for both rights holders and AI developers. Rights holders are unsure how to reserve their rights in accordance with the DSM Directive (e.g. is a reservation in the general terms and conditions sufficient?) while AI developers are uncertain about which content is available for TDM. Therefore, standardising the opt-out process is crucial. Some organisations have already engaged in standardisation activities (e.g. the “TDM Reservation Protocol” or the “TDM AI Protocol”). The Regional Court of Hamburg offered clarification in the Kneschke v. LAION case of 27 September 2024, determining that an opt-out in natural language within the website’s terms of use would have been sufficient for the rights holder to exclude the commercial TDM exception.

1.2 Temporary reproduction

The InfoSoc Directive includes another important exception for temporary reproductions. In the Infopaq I and II cases, the CJEU clarified that temporary acts of reproduction during “data capture” processes can fall under this exemption, provided that the following five cumulative and strictly interpreted conditions are met whereby the reproduction:

  • is temporary;
  • is of a transient or incidental nature;
  • constitutes an integral and essential part of a technological process;
  • has the sole purpose of enabling a lawful use of a work; and
  • has no independent economic significance.

Any exception to copyright must pass the “three-step test”, which stipulates that exceptions may only be applied in “certain special cases”, provided that they do “not conflict with the normal exploitation of works” and do “not unreasonably prejudice the legitimate interests of the rights holder”.

AI developers need to meticulously evaluate whether their TDM activities, or any other use of protected content for training their AI models, meet these conditions.

Although the CJEU has not yet ruled on applying this exception to TDM for training AI models, the data capture process in the Infopaq cases shares similarities with modern TDM approaches. In Infopaq, newspaper articles were scanned and searched to create news alerts based on keywords. However, meeting the exception’s conditions in TDM processes can be challenging due to their strict interpretation. The economic significance of TDM is greater and its scale makes it harder to justify that the use is ‘temporary’ and ‘incidental.’ As such, the CJEU may find that TDM for AI training harms the normal exploitation of works and unreasonably prejudices the legitimate interests of the rights holder.

1.3 Fair use doctrine in the US

In contrast to the EU’s closed system of exceptions, the US has an open standard known as the fair use doctrine. Under the US fair use doctrine, several factors determine whether a particular use of a copyrighted work is fair and does not require the rights holder’s consent.

It appears only if the purpose of using copyrighted material in training data is significantly different from the original works’ purpose, developers of AI tools might establish a fair use defence. The courts must determine whether the purpose of training data is different enough from that of the original copyrighted works to justify fair use, or if using copyrighted works as training data inherently constitutes copyright infringement

2. Conclusion

Navigating the intersection of copyright and generative AI is complex, particularly when it comes to TDM for training AI models. Companies must carefully assess whether their TDM activities require authorisation from rights holders or if they can rely on copyright exceptions. In the EU, the DSM Directive provides specific exceptions for TDM, but these come with conditions and the need for lawful access to the content. Rights holders can reserve their rights to prevent TDM, adding another layer of complexity.

This article was written by Valeska De Pauw, with the valued assistance of AI. It is part of the series “AI and intellectual property rights”, written by the IP lawyers at CMS in Belgium. All the articles are available on our website

dotted_texture