LLMs – The Great Data Heist

9th January 2025

Large Language Models (LLMs) are a type of artificial intelligence (AI) that use deep learning algorithms to understand and generate text, images and videos in a way that mimics similar outputs created by humans, basing its responses on the vast quantities of data they are trained on. This data includes text and code scraped from the internet, licensed information, and user inputs. It can be output as simple text, code, translations, images, or audio.

While LLMs can generate creative text formats, legal documents, or summarise information, they are not designed to be inherently truthful or unbiased and can generate inaccurate or misleading content. The models can produce plausible sounding but erroneous information that is often referred to as “careless speech” by researchers and this raises questions on the potential impacts of LLMs being used for research, education purposes or informing the public.

Two points are worth bearing in mind. First, many publicly available LLM’s are offered on a ‘non partitioned’ basis, which means that every user is effectively using the same version of the product. There is no guarantee that their data will be kept separate form the data of others.

Second, while the output of an LLM may appear similar to outputs created humans, the process behind the creation is very different. A human creates an output by going through a process of conscious consideration about what it is being asked to create and why its output is going to be a particular way before giving a response. An LLM on the other hand essentially generates answers by going through a process of considering previous inputs and outputs, then performing high-volume guesswork to provide an answer that looks similar to seemingly relevant/similar answers which have come before. But the LLM does not understand why those previous answers were the way they were and is only offering up a product which feels similar enough to them (and if asked to explain its output would perform a similar guessing process to do so). So LLM outputs give the impression of a considered responses but are not, which leaves open the possibility of ‘hallucinations’ where the LLM’s process of guesswork is inaccurate.

The popularity of LLMs has grown significantly in recent time. With an annual growth rate of over 90%, the promotion of LLMs has become a key focus for the tech industry. Popular examples of LLMs include OpenAI’s ChatGPT, Google’s Gemini, and Microsoft’s Copilot. Cyberhaven conducted a study last year and found that, as of June 2023, almost 11% of employees had used ChatGPT in the workplace, 8.6% have pasted company data into it, and 4.7% of employees have pasted confidential data into the ChatGPT[1]. Users may be unaware that many publicly accessible LLMs further train themselves on their input information and that there is a non-negligible risk of information contained within it being reused as part of an answer to a third party’s future input that  has seeming relevance to that confidential data.[2][3] The legal concerns do not stop there…

IP infringement and data scraping

Due to LLMs drawing information from vast quantities of data and simply predicting the next likely word, line of code or addition to an image, they pose risk to owners of intellectual property rights by potentially infringing on their copyright. An LLM is only trying to give a plausible response to an input, in doing so it has little regard for whether its output copies/infringes/mimics previous works, giving a serious risk of the infringement of intellectual property rights.

Additionally, there are wide-ranging privacy concerns. OpenAI trained ChatGPT with 300 billion words scraped from books, articles, websites, reviews and social media posts. This library includes personal information obtained without consent from the individuals it was taken from. OpenAI currently offers no procedures for individuals to check whether their personal information is stored by the company (in its original form or as part of an algorithmic model) or request it to be deleted.

GDPR

The European Union’s General Data Protection Regulations (“GDPR”) also requires that personal data processing is done on a lawful basis. In practice this would be either consent, contractual necessity, or a legitimate interest. However, indiscriminate internet scraping for LLM training is unlikely to meet any of these bases and there are legitimate concerns (which have been expressed by a range of regulators and commentators) that popular LLMs are therefore non-compliant with the GDPR which requires explicit consent be obtained from individuals before processing their personal data.[4] Furthermore, the core principle of data minimisation set out in the GDPR requires companies to limit data collection to what is necessary for a specific purpose. It is not clear that the creation of an LLM is specific enough to meet this requirement. Google has sought to address this challenge for its NotebookLM product by limiting the number of sources the LLM has access to by requiring users to feed in a maximum of 20 sources but the question still remains whether the purpose is consistent with the reason the data was collected in the first place.

Where employees and others copy and paste information into an LLM, legacy data loss prevention platforms and other security products that monitor data outflows and highlight potential security risks cannot track confidential data once it has been copied from a file and rehashed through ChatGPT. The copied shared sensitive data then has the potential risk to be intercepted or accessed via malicious software and lead companies into data privacy and confidentiality issues. Companies should have an AI policy in place for staff relating to the use of LLMs and update their data security policies to limit liability from employee’s use of AI tools to mitigate any issues surrounding data privacy and confidentiality from spiralling into litigation.

Beyond the legal aspect, there are serious cybersecurity concerns about these tech companies holding this vast amount of personal data, which may (with the best will in the World) be compromised en masse. On 21 March 2023, OpenAI was forced to shut down ChatGPT due to a bug that mislabelled chats in user’s history with the titles of chats from other users which could have exposed  sensitive or confidential information to other ChatGPT users.[5]

The legal ramifications are growing in scale with the growth of the training data used for LLMs. For example, ChatGPT-3.5 was trained on roughly 175 billion parameters whereas its successor, the powerful ChatGPT-4 is trained on a rumoured 1 trillion parameters.[6]

Taking steps to regulate

UK

Lord Chancellor Shabana Mahmood signed on 5 September 2024[7] an international agreement drafted by the Council of Europe that commits the parties to collective action in managing AI products and protecting the public from potential misuse. The over-arching objectives of the agreement are to strengthen safeguards against risks to human rights (including personal data), democracy and the rule of law. Countries that have signed up will now monitor AI development and ensure any technology is managed within strict parameters. The treaty will come into force three months after five signatories ratify it.

The King’s Speech in July 2024 hinted at future legislation targeting the “most powerful AI models” but contrary to expectation did not expressly mention an AI Bill for the United Kingdom.[8] The current approach appears to be the Government encouraging established regulators like the Information Commissioner’s Office (ICO), the Competition and Markets Authority (CMA), and Ofcom to stretch and adapt existing regulatory regimes to address AI-related challenges for a more industry-led approach.

EU

The European Union (EU) has taken a more prescriptive approach through the introduction of an AI Act. The AI Act outlines a tiered risk classification for AI systems. The EU is banning systems deemed to pose unacceptable risks; such as social scoring systems; while placing obligations on providers of high-risk applications like those used for critical infrastructure. Systems said to pose limited or minimal risk face fewer regulatory requirements. Additionally, the AI Act sets out rules for ‘general-purpose AI’, which utilise LLMs for wide purposes like OpenAI’s ChatGPT. These are said to pose “systemic risk,” and developers will be required to perform risk assessments as well as track and report serious incidents.[9]

What’s Next?

LLMs are an incredible tool that will change the way we work, educate, organise, and search for answers to our most burning questions; however, we have recently witnessed (or realised there has been?) a non-consensual, global scale use of intellectual property rights and personal data to fuel the development of LLMs to reach this point, with unending appetite from the next generation of technology for more.

Regulation is inevitable, and time will show whether a legislative approach or collaborative broadening of application of existing regulations will find the correct balance needed to protect rights and interests of businesses and people without choking-off a paradigm-shifting technical innovation.

[1] 11% of data employees paste into ChatGPT is confidential | Cyberhaven

[2] Does ChatGPT Save My Data? OpenAI’s Privacy Policy Explained (tech.co)

[3] https://www.cyberhaven.com/blog/4-2-of-workers-have-pasted-company-data-into-chatgpt

[4] ChatGPT and Data | Avast

[5] ChatGPT Bug Exposed Some Subscribers’ Payment Info – CNET

[6] GPT 3.5 vs. GPT 4: What’s the Difference? (howtogeek.com)

[7] UK signs first international treaty addressing risks of artificial intelligence – GOV.UK (www.gov.uk)

[8] Labour sets out its stall – UK AI, data and cyber security legislation announced in King’s Speech (taylorwessing.com)

[9] What We Know About the New U.K. Government’s Approach to AI | TIME