· Insights  · 7 min read

Can You Train an LLM on CC BY-SA Content?

While Wikipedia's content is freely available under CC BY-SA licensing, the requirements for attribution, share-alike distribution, and technological restrictions pose major hurdles for LLM developers. Most companies rely on fair use doctrine, but this approach carries risks.
tl;dr

Without relying on the fair use doctrine, training LLMs on Wikipedia content requires compliance with CC BY-SA requirements.

These licensing requirements present three major challenges for LLM training:

  • mandatory attribution for outputs,
  • share-alike provisions that could require open-sourcing model weights, and
  • restrictions that conflict with safety guardrails.

Current LLM technology cannot reliably meet these requirements. While many developers rely on fair use, this remains legally untested.

Can you train a large language model on Wikipedia or other CC BY-SA content? As I discussed previously, Wikimedia’s answer is no. But what if you do your own assessment of the licensing terms?

For context, Wikipedia mandates a CC BY-SA license for all uploaded content. Technically, there’s a dual-license with the GNU Free Documentation License, let’s save that for later. (It turns out that when we don’t solve an issue, ignoring it for a decade often doesn’t make it go away.)

And of course, you can’t have a post like this without the usual disclaimer: this article is for informational purposes only and does not constitute legal advice. While I draw from my own research and experience training a model, every business situation is unique. For specific legal guidance tailored to your circumstances, please consult with a qualified attorney.

Analyze Compliance Requirements

When we trained our model at 273 Ventures, we kept a detailed inventory of everything that we collected, including our analysis of our legal basis to use it. Our own personal take was that Wikipedia was a non-starter due to its copyleft licensing.

As I mentioned earlier, Wikipedia content is licensed CC BY-SA, which means that the material can be redistributed, remixed, transformed, and copied - even for commercial purposes. But it also means that you MUST:

  • Provide Attribution: The CC BY-SA 4.0 specifies that “If supplied, you must provide the name of the creator and attribution parties, a copyright notice, a license notice, a disclaimer notice, and a link to the material.”
  • Share-alike: Under CC BY-SA 4.0 “If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.”
  • Not Restrict Permitted Activities: You also MAY NOT “apply legal terms or technological measures that legally restrict others from doing anything the license permits.”

So, if you train an LLM on Wikipedia material, you need to provide attribution with every output. You need to distribute any transformation or built-upon material under a CC BY-SA 4.0 license. Your model’s terms of use can’t restrict others from their use under CC BY-SA (which, remember, is pretty much anything as long as you comply with the obligations). Let’s look at each of these in a bit more detail.

Attribution

The attribution requirement includes provision of the following information (if it was originally provided by the licensor of the CC BY-SA content):

  • Identification of the creator (can be by pseudonym, if indicated by the licensor)
  • Copyright notice
  • Reference to the CC BY-SA license and indication that the content is licensed under it
  • Disclaimer of warranties
  • Link to the licensed content (to the extent that it is reasonably practicable)

The above requirement can be met by providing a URL to such information. Such attribution is required even if licensed content has been modified. If modifications have been made, you must indicate that you have done so (as well as including indications of previous modifications).

Given the nature of LLMs, both in terms of input and output, in order to meet this requirement the developer would have to be able to accurately identify what input resulted in the output and to match the attribution info that was provided. Clearly this is a tall order, because you would also have to ensure that the attribution (and source) was not hallucinated.

Share-Alike

The share-alike element requires that adapted material be licensed CC BY-SA (or another compatible license), with either the text of the license or a link to it. The logical next question is “what’s considered adapted material?” Is it the outputs? The model weights themselves? Again, we end up in another minefield of unanswered questions.

The answer to these questions isn’t a pedantic matter - if the model weights are considered a derivative of the inputs, then the weights must also be licensed CC BY-SA. It’s not enough to say that this would have a significant impact - there could be no closed source models trained on Wikipedia.

Restrictions

Adapted material can’t be subject to any additional terms that restrict the exercise of the rights granted under the CC BY-SA (or equivalent) license. I’m looking at you, Llama 3 community license. In addition to legal restrictions, adapted material can’t be subject to technological measures that restrict actions available under the license. A strict interpretation of this would mean that all of the safety guardrails are a violation of this requirement; it’s clear that this stands in direct opposition to numerous regulations that have been established (or at the very least proposed) in jurisdictions around the world.

Fair Use and Fair Dealing

Per the license terms, use of material licensed CC BY-SA 4.0 is exempt from the license requirements if it is subject to a fair use, fair dealing, or other similar copyright exception/limitation. This is the most likely justification for developers’ use of Wikipedia content (as well as anything else they can get their hands on) as training material. I’ve been a fairly vocal opponent of the fair use argument in LLMs - you can read my full analysis of why I believe fair use is exactly not that. The Author’s Guild case, which was decided in 2015, provided a clear precedent with respect to text snippets and fair use1; I expect that until we have similar precedence established for whether the use of copyrighted material in LLM training is truly fair use, this will remain a legal risk.

It’s important to note that the concepts of fair use, fair dealing, and other related exceptions are not a globally-accepted concept. There are a number of notable jurisdictions for which there is no such exception, in which case you would have to comply with the license terms.

As I just discussed, there’s a lack of legal clarity around whether training on Wikipedia content constitutes fair use. From a risk management perspective, this legal risk is most likely to be addressed in one of two ways: avoidance (i.e., not training on copyrighted material, which is the method we opted for when we trained our KL3M model) and acceptance (i.e., rely on fair use and cross the judicial bridge when you get to it).

If we assume that training does not constitute fair use (or that the judicial or legislative process narrows the definition of fair use or otherwise excludes model training from fair use), all use of Wikipedia content (or any other CC BY-SA content) would have to comply with the license terms. This brings us to our next hurdle: the technological means by which this can be accomplished.

At this point in time, no LLM model has been able to demonstrate compliance with the attribution requirement of CC BY-SA licensing. Models may be able to occasionally provide proper attribution of outputs, but issues of hallucination further complicate the obligation. With the current technological capabilities, retrieval-augmented generation (RAG) is the only reliable method of accomplishing the task, but if the goal is to use Wikipedia content as training material, rather than as the external knowledge base, this misses the mark.

So, Can You Train an LLM on Wikipedia?

It depends. (Classic answer, right?) The deciding factor (in the US, at least) is whether training a model on copyrighted material is truly fair use. Most LLM developers have either decided that training (whether pre-training or fine-tuning) is fair use or they haven’t given it consideration and just used information because it was on the internet.

From my own perspective the answer is no: current state LLMs cannot (compliantly) train on CC BY-SA content.

Footnotes

  1. The Authors Guild Inc., et al. v. Google, Inc.

Related Posts

View All Posts »