Generative AI – an intellectual property minefield; Guest blog - Linklaters
Among the first generative AI platforms to go viral were text-to-image generators, which create images from text prompts in seconds. Some of the most popular are OpenAI’s Dall-E 2, Stability AI’s Stable Diffusion, and Midjourney, the latter producing outputs so photorealistic that they led the internet to believe (incorrectly) that Pope Francis wears Balenciaga and that Donald Trump had dramatically resisted arrest.
Then came ChatGPT, OpenAI’s AI-powered chatbot that can provide detailed and convincing (but not always correct) text responses to just about any question. ChatGPT is hugely popular. Released amid a media firestorm in November 2022 and free to users, it quickly became the fastest growing consumer app ever, clearing one million users in its first five days and 100 million monthly active users within two months of launch.
Commercial and industrial applications of generative AI are also growing exponentially. In January, Microsoft announced a US$10 billion investment into Open AI and subsequently introduced ChatGPT-type technology into its Bing search-engine on Windows 11. Google swiftly responded, introducing Bard, its answer to ChatGPT, in February, before announcing at its annual I/O conference in May that it will start to infuse its own search results with generative AI technology. Companies from Meta to Coca-Cola are also experimenting with this technology in myriad ways. Even at this halfway point, it seems safe to conclude that 2023 will be the year of generative AI.
However, there are many reasons for adopters to tread carefully. Not least because generative AI remains a minefield of intellectual property (IP) law issues.
Inputs - AI training data and IP
Generative AI requires vast amounts of high-quality data to learn. Data is typically sourced from the internet (directly or indirectly), often without permission, and inevitably in a manner that creates copies.
OpenAI has said that its DALL-E 2 image generator is trained on around 650 million images, from a mix of publicly available and licensed sources. ChatGPT claims to have been trained on approximately 45 terabytes of text from books, articles, websites and other sources. According to McKinsey, that’s about one million feet of bookshelf space.
OpenAI’s datasets have not been made public. Stability AI, on the other hand, has been transparent about Stable Diffusion’s training data. It comprises over 2 billion images from datasets collated by LAION and originally scraped from the internet. While it is hard to meaningfully interrogate such huge quantities of data, analysis of a sample of 12 million images found that half were sourced from approximately 100 domains, with the largest number coming from Pinterest, and most of the rest from user-generated content platforms, shopping and stock image sites.
Whether or not this sort of copying infringes copyright (and/or other IP rights) in the source material may depend on where the copying takes place. Moreover, the relevant law is, in many jurisdictions, unclear, in flux, or both.
- In the EU, using text and data mining (TDM) methods for commercial purposes on lawfully accessed works is permitted, provided that the rights holder has not opted out “in an appropriate manner, such as machine-readable means” (e.g., using a robots.txt file). However, having legislated in this AI-friendly manner in 2019 (with an implementation deadline of 2021), the EU is now finding other ways to ensure an appropriate balance between AI companies and rightsholders. For example, under the EU Parliament’s latest proposal, the EU’s AI Act will require AI providers to publish a detailed summary of any copyright protected content used in their datasets – a move intended to safeguard against misuse.
- In the UK, TDM is not currently permitted, save for non-commercial purposes. A proposal to permit commercial TDM (with no opt-out) was axed in February following a forceful backlash from the creative industries, and the latest UK Government communications have indicated that the latest plan does not involve a new exception (of any sort) for TDM, but rather a code of practice aimed at improving the IP licensing environment.
- In the US, following the Google Books case, it is generally considered that the US “fair use” defence to copyright infringement may permit TDM. However, this AI-friendly outcome is far from certain: there are at least three lawsuits currently working their way through US courts (two alleging IP infringement by Stability AI and another relating to Microsoft’s CoPilot programming tool) alleging that training AI models on publicly available works constitutes copyright infringement. Watch this space.
The current position may therefore favour AI companies over rightsholders to some extent, as they can choose the most tech-friendly regulatory environments in which to (lawfully) train their AI models before rolling them out internationally. Moreover, the “black box” nature of most AI models makes enforcement by rightsholders challenging. However, legislators are scrambling to regulate in this space to find the right balance between AI companies and IP rightsholders - and there is a lot of change on the horizon.
Outputs - AI-generated content and IP
The primary risk for end users of generative AI is that its outputs may not comprise entirely new content but instead reproduce the whole or substantial parts of existing copyright works.
In some cases, this will be relatively clear. For example, as recently as March of this year, ChatGPT provided answers to our questions reproducing large chunks of various copyright works including novels and song lyrics. At the time of writing, however, the same questions (generally) result in a refusal by ChatGPT as doing so “would infringe upon the rights of the original author”. Presumably the recent flurry of litigation around generative AI and copyright has brought this issue to the fore for OpenAI.
Other outputs from ChatGPT (and similar chatbots) that reproduce significant amounts of third party content may be less easy to spot. The fewer the data inputs in relation to any topic, the more likely the chatbot is to reproduce the material on which it was trained rather than generalised principles from multiple inputs - and there will be no way for a user to identify when this is the case.
Similarly, it will be near impossible for the user of a text-to-image generators to assess the similarity of any output to any image within the training dataset.
Other IP rights are also at play. Stock image provider Getty Images has recently brought IP infringement proceedings in the UK High Court against Stability AI, alleging that its images have been used to train Stable Diffusion without authorisation, infringing its copyrights and database rights; that some of Stable Diffusion’s outputs infringe Getty’s copyrights; and that Stable Diffusion’s outputs recreate and infringe Getty’s trade marks.
Proceed with caution…
While many have significant reservations about AI’s inexorable march, generative AI technology has clear benefits and usage is increasing exponentially.
Adopters should, however, proceed with caution. At the least, users should:
- Read the T&Cs. What do the T&Cs say about ownership of IP (if any) in any outputs and any restrictions on its use, e.g., use for commercial purposes? How is liability apportioned and what liabilities are you taking on by using the platform?
- Be careful not to input any confidential information. Text prompts and other AI inputs are typically logged by the AI platform and become accessible to the AI provider, with no clear obligation of confidentiality (although some platforms are introducing privacy features and/or subscription versions with more stringent rules). This is likely to be increasingly problematic for businesses as AI becomes more powerful. ChatGPT’s successor GPT-4, for example, allows text prompts of up to 25,000 words, making it increasingly likely that employees will be tempted to feed it significant quantities of potentially sensitive and valuable data on which to perform tasks.
- Carefully assess the risks arising out of any proposed uses of AI outputs. Any reproduction of these risks copyright infringement, so the safest course is to use AI outputs for information or inspiration and as a starting point, not a final product.