Nov 18, 2024 · Insights · 11 min read

Can You Train an LLM on Wikipedia? Wikimedia's Surprising Opinion

tl;dr

Wikimedia Foundation doesn’t believe most LLM companies’ are complying with Creative Commons licensing requirements when using Wikipedia content for training.

Key issues include proper attribution and adherence to share-alike provisions.

With nearly 62 million pages and seven million articles in English alone, Wikipedia is a textual treasure trove. It’s no wonder that so many datasets and large language model developers have started with Wikipedia and other Wiki projects under the Wikimedia Foundation umbrella.

The only question, of course, is whether they should have.

If the answer were unquestionably yes, then this would be easy. Unfortunately, it’s not.

What if I told you that Wikimedia doesn’t think pretraining is OK?

(Pretraining compliance is a mess, and there’s more to unpack on the history of both Wiki and the web itself, but that’s a story for another day.)

Background

I’m very proud of the fact that during our KL3M model training, we developed a groundbreaking technique that had never been tried before - not even by OpenAI or Anthropic or DeepMind.

It’s called “consent.” Let me explain.

When we had doubts about using Wikipedia content in pretraining, we reached out to them.

Their team admitted to us that as of April 2024, we were the only people who had asked how to comply with the licensing requirement when training a model.

Let’s be real though: you probably don’t care about what I have to say about this. You care about what they had to say.

Wikimedia’s Response

I’m including the most salient points below, but for full transparency, our full email correspondence is at the end of this post (minus a handful of back-and-forth update requests and scheduling emails. I’ve also redacted the employees names to respect their privacy). As an important caveat, Wikimedia isn’t providing legal advice or analysis, but rather their overarching principles and thoughts on LLM training and licensing compliance.

The most damning comment from them is buried in the middle of one of the emails:

“We are monitoring what many LLM companies do with Wikimedia data and generally to be upfront, many may not be compliant with the letter of the Creative Commons rules or the spirit of the licenses.”

They get into a bit more detail about why this is the case, stating that “there are many considerations for downstream licensees. For example, downstream licensees would need to continue to adhere to the ‘attribution,’ ‘share-alike,’ and other elements of the license.” (Note that I touch on these requirements in the one of the later posts)

We asked whether they would be open to publicly taking a position on the elements that we discussed in the email chain, but they responded that “it looks unlikely the the [sic] legal team would be willing/able to provide the letter you’re asking for. What we’re discussing, instead, is a public statement via our main Wikimedia blog setting out our position on the attribution topic[.]” The only public statement that I was able to find relating to this topic (and it’s a bit of a stretch) is an open letter to the UN Member States regarding the upcoming Global Digital Compact¹, where they encouraged inclusion of a commitment to “Build and deploy Artificial Intelligence (AI) and machine learning (ML) to support and empower, not replace, people who create content and make decisions in the public interest[,]” noting that “[u]ltimately, AI and ML systems can only be reliable if they are trained on well-sourced knowledge, created and moderated by humans who receive appropriate attribution for their contributions.”

It looks like Wikimedia does not want to take a strong public stance on if (and how) an LLM can compliantly train on Wikipedia content. As a non-profit, they may feel the need to take a middle-of-the-road approach to ensure that they don’t rock the boat with donors (which included Google, Apple, and Microsoft in the most recently released list)².

Other Approaches

Although Wikimedia has a clear opinion about whether you can train an LLM on Wikipedia, most model developers have a different take (or have literally never thought about it). What about if you analyze the license terms? Well, it turns out that the conclusion that you come to will either be closer to Wikimedia’s or to the common developer depending on your assessment of fair use.

In my other posts on the topic, I discuss the “everybody’s doing it” approach (including a history of this approach in technology and where it’s gotten us) and analyze whether you can train a model with CC BY-SA licensed content.

Email Transcripts

Michael Bommarito
Mar 4, 2024, 12:32 PST

Hello,

We are writing to request clarification for the proper way to comply with Wikimedia project licensing in the training of large language models.

If we provide attribution and a link to the license for each project in a notice document provided to customers, is that sufficient for license compliance from your perspective?

Thank you.

Your request (74835) has been updated. To add additional comments, reply to this email.

[NAME REDACTED]
Mar 9, 2024, 09:21 PST

Hi Michael,

Thanks for reaching out. This is an important request, so I am soliciting a handful of different perspectives. By the end of next week, I’ll provide a short answer plus a handful of cc’d colleages [sic].

All the best,
[NAME REDACTED]

Your request (74835) has been updated. To add additional comments, reply to this email.

[NAME REDACTED]
Fri, Mar 22, 2024 at 3:28 PM
To: mike@273ventures.com
Cc: [MULTIPLE NAMES REDACTED]

Hi Michael,

Thanks for reaching out! Apologize for the delay. I was consulting with our legal department to give you the best answer I can. We’ve discussed this topic internally for the better part of the last year, so my apologies if this is a lengthy email.

Before talking about the legal aspects around compliance with the open licenses, here’s the simplest way to understand our perspective. New additions to Wikipedia rely on active volunteers, and these volunteers come from everyday readers who learn they add to Wikipedia because they follow attribution links back to our site.

So fundamentally, proper attribution from bulk reusers like yourselves ensures that Wikipedia can continue to exist long-term. This is especially important in the context of LLM training where there’s no direct throughline between the Wikipedia data you use and providing a link back to Wikipedia that the end user can see in the output.

I’ll try to address your question below, but I may bring up more issues than solutions here. This is not legal advice since compliance with the licenses is first and foremost a legal issue between you and your legal team. What I’ll be describing are some general principles as well as Wikimedia’s ideals when it comes to reuse.

Providing a general notice to customers would not be an adequate solution to compliance. While a positive first step, supplying a notice document to “customers” would likely not fully meet the requirements for compliance. This is for a variety of reasons but chief among them: the notice would need to be made to everyone the content is shared with, not just customers. Moreover, the visibility of the notice would need to follow the Creative Commons criteria of being “reasonable to the medium.”

As an FYI, Wikipedia’s licenses are primarily Creative Commons Attribution-ShareAlike (CC BY-SA) licenses. You can read the full text of the license here.

Each piece of content in Wikimedia projects has its own specific license. A Wikipedia article is licensed by dozens of people but in theory, can be singularly attributed to Wikipedia as the Author under a CC-BY-SA 4.0 license (see above). Contrast that with Wikimedia Commons where every photo needs to be named with the individual Author and each photo’s individual license (anything from CC-BY 1.0 to CC-BY-SA 4.0) to be considered properly attributed. Wikinews uses a CC-BY-SA license but not 4.0. Wikimedia Commons is something we’re roadmapping at the moment to proactively address the larger, growing needs of the generative AI landscape around video/images/audio reuse. If you are interested, we can chat further about this.

Finally, a small amount of content specifically on Wikipedia is published under fair use (not openly licensed). This is small but meaningful if the goal is to provide relatively unencumbered data. The list of nuances in our dataset is long but ultimately navigable if work is put in to do so.

I can continue, but I think this email is more useful for opening up questions than resolving them concretely. We are monitoring what many LLM companies do with Wikimedia data and generally to be upfront, many may not be compliant with the letter of the Creative Commons rules or the spirit of the licenses.

That’s why we’re happy you’ve reached out to discuss this with us early on as you refine your business model. As part of the Wikimedia Enterprise team, we can help you explore opportunities where attribution gets simplified for reuse, including in an LLM context.

Our legal department did express some practical concerns when I pointed them to this Wired Magazine article. It’s definitely forward-thinking that you’re attempting to create a fully public-domain database, but they described challenges in characterizing any training set that contains Wikimedia content as “public domain,” since there are many considerations for downstream licensees.

For example, downstream licensees would need to continue to adhere to the “attribution,” “share-alike,” and other elements of the license. To the extent you intend to include Wikimedia data in your set (or already have), our legal team suggested it would be very important to be in contact with them. Typically, my team can facilitate conversations like that since open licensing support is the kind of thing we assist Enterprise customers with.

Wikimedia Enterprise is a team established to help large commercial reusers ingest and make the best use of Wikimedia data. Our customers benefit from a worldwide, royalty-free license to use the Wikimedia Enterprise APIs to access Wikipedia project data. Under our contract, content provided through the API is still subject to the applicable Free Culture Licenses and Open Source Licenses but companies training LLMs who are existing customers have found our support useful since complexity is high, but ultimately not-a-barrier.

I’m sharing our deck and roadmap with you for you to learn more about our offerings. Collaborative discussions like this are crucial for shaping an ecosystem that respects open knowledge while still fostering innovation. If you’d like to discuss this further, please let us know. We are happy to continue the conversation and explore potential solutions or partnerships that align with our values while enabling your important work.

Cheers,
[NAME REDACTED]
Wikimedia Enterprise

Trademark - Request for Clarification

Michael Bommarito <mike@273ventures.com>
Fri, Mar 22, 2024 at 3:38 PM
To: [NAME REDACTED] <REDACTED@wikimedia.org>
Cc: [MULTIPLE NAMES REDACTED]

Hi [NAME REDACTED],

Thank you. This is extremely helpful and consistent with what we thought your opinion would be. A few topics below:

For the record, we have not included any Wikimedia content, even Wikinews, into our KL3M model. You can speak directly with Ed at Fairly Trained, but none of us were comfortable including any content under licensing like GFDL or BY/SA. We completely understand and tend to agree with your position on these topics.
Would you be open to a call to discuss your licensing options further? We are looking at financing or partnerships to support additional content licensing, and it would be very helpful to talk through your vision for this part of your organization.
Somewhat selfishly, would you be open to publishing your opinion on this issue in public? If not, would it be OK if we described to others what you have outlined in this email?

Many of our public conversations or customers specifically ask about Wiki content, and we have never been able to point at anything definitive published by you.

Thank you again for taking the time to put together such a clear and thoughtful response.

[NAME REDACTED] <REDACTED@wikimedia.org>
Tue, Apr 9, 2024 at 9:44 AM
To: Michael Bommarito <mike@273ventures.com>

Hey there, Michael. At the moment, it looks unlikely the the [sic] legal team would be willing/able to provide the letter you’re asking for. What we’re discussing, instead, is a public statement via our main Wikimedia blog setting out our position on the attribution topic, which folks seems broadly more comfortable with. Working on getting the details of that figured out this week, and will share more as soon as I have it.

Trademark - Request for Clarification

jillian bommarito

Can You Train an LLM on Wikipedia? Wikimedia's Surprising Opinion

Background

Wikimedia’s Response

Other Approaches

Email Transcripts

Footnotes

Related Posts

Can You Train an LLM on CC BY-SA Content?

EU AI Code of Practice Draft: What Boards Should Know

AI in a Political Land: Navigating When to Act

AI Lifecycle and the Board's Role