When OpenAI released GPT-3 in July 2020, it offered a glimpse of the data used to train the large language model. Millions of pages scraped from the web, Reddit posts, books, and more are used to create the generative text system, according to a technical paper. Scooped up in this data is some of the personal information you share about yourself online. This data is now getting OpenAI into trouble.
On March 31, Italy’s data regulator issued a temporary emergency decision demanding OpenAI stop using the personal information of millions of Italians that’s included in its training data. According to the regulator, Garante per la Protezione dei Dati Personali, OpenAI doesn’t have the legal right to use people’s personal information in ChatGPT. In response, OpenAI has stopped people in Italy from accessing its chatbot while it provides responses to the officials, who are investigating further.
The action is the first taken against ChatGPT by a Western regulator and highlights privacy tensions around the creation of giant generative AI models, which are often trained on vast swathes of internet data. Just as artists and media companies have complained that generative AI developers have used their work without permission, the data regulator is now saying the same for people’s personal information.
Similar decisions could follow all across Europe. In the days since Italy announced its probe, data regulators in France, Germany, and Ireland have contacted the Garante to ask for more information on its findings. “If the business model has just been to scrape the internet for whatever you could find, then there might be a really significant issue here,” says Tobias Judin, the head of international at Norway’s data protection authority, which is monitoring developments. Judin adds that if a model is built on data that may be unlawfully collected, it raises questions about whether anyone can use the tools legally.
Italy’s blow to OpenAI also comes as scrutiny of large AI models is steadily increasing. On March 29, tech leaders called for a pause on the development of systems like ChatGPT, fearing its future implications. Judin says the Italian decision highlights more immediate concerns. “Essentially, we’re seeing that AI development to date could potentially have a massive shortcoming,” Judin says.
The Italian Job
Europe’s GDPR rules, which cover the way organizations collect, store, and use people’s personal data, protect the data of more than 400 million people across the continent. This personal data can be anything from a person’s name to their IP address—if it can be used to identify someone, it can count as their personal information. Unlike the patchwork of state-level privacy rules in the United States, GDPR’s protections apply if people’s information is freely available online. In short: Just because someone’s information is public doesn’t mean you can vaccuum it up and do anything you want with it.
Italy’s Garante believes ChatGPT has four problems under GDPR: OpenAI doesn’t have age controls to stop people under the age of 13 from using the text generation system; it can provide information about people that isn’t accurate; and people haven’t been told their data was collected. Perhaps most importantly, its fourth argument claims there is “no legal basis” for collecting people’s personal information in the massive swells of data used to train ChatGPT.
“The Italians have called their bluff,” says Lilian Edwards, a professor of law, innovation, and society at Newcastle University in the UK. “It did seem pretty evident in the EU that this was a breach of data protection law.”