Training LLMs: Questions Rise Over AI Auto Opt-In by Vendors
Few Restrictions Appear to Exist, Provided Companies Behave TransparentlyCan individuals' personal data and content be used by artificial intelligence firms to train their large language models without requiring users to opt in?
See Also: How Active Directory Security Drives Operational Resilience
That question is being asked more often as generative AI tools continue their rapid rise. Many users are noticing terms and conditions that - by default - appear to be predisposed to feeding the ravenous LLMs required to power gen AI systems. Cue ethical and privacy concerns.
"Just discovered that slack enterprise customers are automatically opted in to have their data used to train slack's global #llms (likely salesforce's xgen models)," Khash Kiani, head of trust, security and IT at gen AI customer engagement firm ASAPP in San Francisco, said in a recent LinkedIn post. "Auto-opting in paying customers into training their global llm is uncalled for!"
Slack, owned by Salesforce, is one of a number of firms - including Adobe, Amazon Web Services, Google Gemini, LinkedIn, OpenAI and many more - that have terms and conditions stating that by default, they can use customers' data and interactions with their services to train their LLMs.
"If you want to exclude your Customer Data from helping to train Slack global models, you can opt out," says Slack's privacy policy. "If you opt out, Customer Data on your workspace will only be used to improve the experience on your own workspace and you will still enjoy all of the benefits of our globally trained AI/ML models without contributing to the underlying models."
Slack said that that privacy policy includes changes made Friday, "to better explain the relationship between customer data and generative AI in Slack." Slack said the LLMs underpinning its messaging spaces - called channels - have no direct access to customers' messages or files, and "we do not build or train these models in such a way that they could learn, memorize or be able to reproduce any customer data of any kind."*
Slack told me its LLMs do not have access to this metadata, but "Slack's global machine learning models do, and this is what customers can request to opt out of." The company said a completely separate product, Slack AI, which customers must purchase as an add-on, is based on LLMs but they "do not have access to any type of customer data at any point, and Slack does not train them on customer data."**
Not everyone opts users in by default. Microsoft says that while interactions between a user and its Copilot chatbot are stored to keep track of their content, "the data is encrypted while it's stored and isn't used to train foundation LLMs, including those used by Microsoft Copilot for Microsoft 365."
Legal and privacy experts say organizations need to carefully document what they're doing, detail how it complies with relevant privacy regulations - including the General Data Protection Regulation in Europe - and above all, be transparent with users.
"Regulators in a number of GDPR countries including Italy and Spain have been looking at transparency with training data, and this is on their radar," said attorney Jonathan Armstrong, a partner at London-based Punter Southall Law.
In March, Italy's Data Protection Authority's latest request to OpenAI pertained to its upcoming text-to-video generator Sora, including how the organization trains the algorithm, collects the training data and from what sources, and whether it collects certain categories of data pertaining to "religious or philosophical beliefs, political opinions, genetic data, health, sexual life."
"Organizations who use these technologies must be clear with their users about how their information will be processed," said U.K. Information Commissioner John Edwards in a speech last week at the New Scientist Emerging Technologies Summit in London. "It's the only way that we continue to reap the benefits of AI and emerging technologies."
Whether opting users in by default complies with GDPR remains an open question. "It's hard to think how an opt-out option can work for AI training data if personal data is involved," Armstrong said. "Unless the opt-out option is really prominent - for example, clear on-screen warnings; burying it in the terms and conditions won't be enough - that's unlikely to satisfy GDPR's transparency requirements."
Clear answers remain potentially forthcoming. "Many privacy leaders have been grappling with questions around topics such as transparency, purpose limitation and grounds to process in relation to the use of personal data in the development and use of AI," said law firm Skadden, Arps, Slate, Meagher & Flom LLP in a response to a request from the U.K. government to domestic regulators to detail their approach to AI. "The ICO does not give any specific answers to these questions."
Instead, the Information Commissioner's Office has designated AI and emerging technology as one of its key areas of focus for the coming year. The regulator is currently running consultations with industry on topics including how to reconcile people's privacy rights with how generative AI models are trained and used, and who's responsible.
"The question of who is responsible for generative AI outputs is a very hot topic," said Camilla Winlo, head of data privacy at consultancy Talan, in a recent blog post. "Generative AI is designed to behave very flexibly and to 'learn' and change over time. By design, developers will not know how the tool will be used. However, users are unlikely to know how the tool works."
Other open questions include whether the EU's "right to be forgotten" is compatible with how LLMs are built (see: RSAC Cryptographers' Panel Tackles AI, Post-Quantum, Privacy).
Both China and the United States appear to face few - if any - such restrictions, unlike Europe, said cryptographer Adi Shamir - the "S" in the RSA cryptosystem. "I'm quite pessimistic about the possibility of legally developing large language models in Europe, unless you break the law," he said earlier this month at RSA Conference.
Having your data be used to train someone else's LLM isn't a risk-free endeavor. Many CISOs rightly remain worried that intellectual property, classified information, regulated data - including people's personal identifiable information - and other sensitive data could end up in someone else's LLM. Once there, the information might get served up to other users, shredding secrets and violating privacy rights, at least in countries that have them.
Given these risks, Britain's Department for Work and Pensions, given the volume of personal information it handles, has banned employees and contractors from using ChatGPT and its ilk, as have a number of other firms.
That doesn't mean organizations with secret or private information won't adopt gen AI tools. If or when they do, it's more likely to be based on a private chatbot outsiders can't access.
Even so, one challenge for CISOs will be trying to keep track of all of the different AI services employees might be using, which is an impossible task, beset by shadow IT concerns. Experts say technical controls, such as data loss prevention software and blocking or filtering of sites and services, can help.
Armstrong said, "Training will be at the heart of this too - telling everyone in the organization the risks and opportunities with AI and giving them a way of managing risk," as well as educating the board so they can assess the risk.
As organizations procure gen AI tools - perhaps based on private LLMs - they're going to need to monitor providers for compliance with applicable privacy or security rules. "Contracts and due diligence will be important too - where your organization is working with a provider 'officially' to develop AI apps for you, it will be important to know who you're dealing with and to set out proper terms of engagement," Armstrong said.
*Update May 21, 2024 14:13 UTC: This story has been updated to include Slack's statement.
**Update May 22, 2024 9:08 UTC: This story has been updated to include further clarification from Slack.