· Legal Updates  · 11 min read

New York Times v. OpenAI - Motion to Dismiss Overview

A detailed analysis of the oral arguments in the New York Times' copyright infringement case against OpenAI and Microsoft, covering technological aspects, direct infringement claims, and DMCA violations.
tl;dr

The New York Times’ copyright infringement case against OpenAI and Microsoft had its first major hearing, where a judge heard arguments on the motion to dismiss. Key issues included direct copyright infringement through training data usage, DMCA violations for removing copyright management information, and unfair competition claims. The hearing revealed technical details about AI training processes and raised important questions about how copyright law applies to AI systems. The complexity of the arguments and the judge’s engagement suggest these claims are likely to proceed to trial.

This morning, a judge heard oral arguments regarding OpenAI and OpenAI and Microsoft’s motion to dismiss the claims that the New York Times (and others) made against them regarding copyright infringement. The judge’s decision following this hearing will determine whether (or to what extent) the case proceeds to trial.

If you’d like a full background of the separate lawsuits that were consolidated into this single case, I’ve written that up here.

Structure of Hearing

U.S. District Judge Sidney Stein opened the hearing by suggesting that it would be most expedient to consider the requests to dismiss on a complaint-by-complaint basis. He also suggested that beginning with a brief overview of the technology would be beneficial; given that all parties agreed with this approach, the New York Times’ counsel began with the technological overview.

Technological Overview

Ms. Maisel, counsel for the NYT, opened her discussion by noting that, similar to how you “follow the money” in a criminal case, you need to follow the data in this type of case. She started by describing how a model is trained through a 4-stage process. While I don’t necessarily agree with the details of her presentation, and the beginning and end of each stage was a bit muddy, the conversation seemed to be useful for Judge Stein and the organization of claims.

Stage 1 was what she called the “ingestion” stage: this is the point at which the Defendants scraped the content from where it had been published online. The unauthorized copies of copyrighted information at this stage is the basis of the direct copyright infringement claim. She noted that this included information that was behind paywalls. It was at this stage that the Defendants engaged in stripping copyright management information (“CMI”), which is the basis of the DMCI Section 1202 claim.

Stage 2, according to Maisel, is the “digestion” stage, where the training content is broken up into tokens (i.e., tokenization) and filtered or weighted for importance. The model is trained to predict the next token. She emphasized that the particular issue here is that high-value content (such as the Plaintiffs’), due to its high-quality and/or accurate content, is more likely to be seen multiple times during training.

At this point, the difference between Stages 3 and 4 became a bit more muddied, in part as the back and forth interrupted her flow. Generally, Maisel alluded to the existence of mid- or post-training, but focused primarily on explaining the difference between pretrain and RAG. Here, she emphasized that memorization and regurgitation are natural results of the pretraining process and the source of subsequently outputted copies, whereas RAG might involve additional information provided by the user or retrieved by the system.

The specifics of Stage 4 was a bit more nebulous based on her statements, but I took her comments to suggest that “synthetic search,” or retrieval augmented generation (“RAG”) was the next step in “following the data.” Step 1 in RAG occurs when a user enters a prompt. Step 2 occurs when OpenAI’s ChatGPT or Microsoft’s Bing Chat (I believe she was referring to Microsoft Copilot, as Bing Chat was the original name of this) translates the user’s query into a search index and finds the content from a website. At step 3, the search index returns a response, which might return a copy of the website that contains the requested comment. Importantly, she noted that this might include the content verbatim, even an entire article; the content is then provided to the LLM. The final step, Step 4, occurs when the LLM outputs a response to the user based on the information it received at Step 3.

From a technical point of view, RAG is an entirely different beast from pre-training a model. Retrieval of content under RAG occurs at the time of time of the user’s input, rather than at the time the model is trained. In terms of technological feasibility, it’s much easier to provide attribution to a source when RAG is used, rather than when the content is included in training data. This however, was not always done appropriately, as we’ll later see in the parties’ discussion of the claim of misappropriation.

Going Down a Rabbithole

Following, or perhaps within, the technological discussion, the Defendants’ counsel made arguments that LLMs are designed to be able to generalize and understand relationships between words and facts, and that the only way to develop this is to provide it with a massive number of examples. He claimed that the only way you can get NYT output is by feeding it the actual article you’re trying to get; he stated that the Plaintiffs had to try thousands of times to get infringing output, and that it was actually very difficult to get it to infringe. As a personal aside, no, it’s really not. I’ve tested multiple models and getting them to leak training data (both copyrighted and not) is not hard.

Microsoft’s counsel made what I consider to be a significant misstep in an example that she gave during this tangential discussion - she said that precedent has made it clear that search engines fall under the exemption of Fair Use, but then she states that the next-generation Bing search engine is so great because it actually answers questions. She digs the Defendants into a deeper hole by giving an example where she queried Bing, it went out and searched a bunch of useful sites, and returned an answer without her needing to visit those sites. To me, this came across as a clear admission that the LLM-powered search is resulting in users no longer visiting the sites from which the information is being pulled. Mr. Crosby, counsel of the NYT, actually addressed this point, calling it an “answer engine” that results in a clear substitution effect.

Given that the technological discussion ended up going a bit off of the rails, with each side making arguments we would expect to see in trial, the Judge attempted to get everyone back on track in terms of what claims OpenAI and Microsoft are looking to dismiss.

The Plaintiffs argue that OpenAI and Microsoft are direct infringers. The Plaintiffs’ counsel states that the key test here is that they selected the copyrighted works and are making them available to users. Hedging their bets a little, the counsel stated that even if Microsoft and OpenAI weren’t direct infringers, by provisioning the copyrighted works and both parties were aware that copyrighted content was in the training set.

The Judge then notes that there is a question in fact with respect to the statute of limitations: when did the parties become aware of the infringement? Andy Gast, as counsel for OpenAI stated that as early as 2019, OpenAI published papers that stated that NYT content was included in training data. He also stated that the NYT published multiple articles about OpenAI’s technology, including a lengthy article that specifically referenced the large amount of data used to train it. He suggested that a reasonable party should at that point have investigated the matter.

The Plaintiffs’ counsel responded that simply knowing that models were trained on large volumes of text isn’t sufficient for a rightsholder to suspect infringement of their own content and stated that the motion to dismiss should be denied. Counsel goes on to state that when OpenAI published the research referenced by Gast, OpenAI had been holding itself out as a non-profit engaging in this research for non-commercial purposes. If this wasn’t the case, then OpenAI fails to meet the first test in the Fair Use argument.

Gast responds to this last point by noting that the NYT article made it clear that OpenAI was going to pursue commercial use, so they should have known at that point.

Contributory Infringement

This particular complaint had a lot of back-and-forth between the parties, as well as interjections and questions by the Judge. As it was difficult to follow, as the parties did not state their names before speaking (as had been requested by the Judge), I don’t include the names of the counsel that was speaking, but only note with which side they’re associated.

The Plaintiffs claim that OpenAI and Microsoft engaged in contributory infringement by building LLMs that output copyrighted content. The Defendants’ counsel referenced the Supreme Court precedent in the Grokster case, noting that case established when a technology company is liable for what its users do with its technology. He stated that based on that case, selling an item that has substantial lawful use with the mere possibility of unlawful use is not sufficient to establish the secondary liability of contributory infringement. He brought up the example of a historical Sony case - a VCR could be used to infringe on content, by allowing users to record copyrighted content; the VCR, however, had substantial non-infringing use that allowed users to view legally-acquired content.

The Daily News’ counsel commented that, with respect to Microsoft alone (who is the only party against whom their complaint is made), they’re liable for contributory infringement because they knew that OpenAI was training using copyrighted material.

Another Plaintiffs’ counsel stated that there are two different types of infringement here - by inducement (encouraging others to infringe) and contributory. In response to the Defendants’ counsel’s earlier mention of Sony, he makes the argument that whereas a VCR could contribute to infringement, the LLMs are like pre-loading the VCR with pirated movies.

In response to this, the Defendants’ counsel discussed why the standard for inducement was not met, but my impression was that he was arguing the wrong point; the Plaintiffs’ counsel had been arguing that it was contributory infringement, not inducement.

DMCA Violation

Moving onto DMCA claims, the Judge requested that everyone speed things up.

The Plaintiffs noted that OpenAI used 2 different content extractors to remove CMI: the programs separated the content from the copyright notices. My own question is: why was OpenAI removing copyrights from content if they thought that what they were doing was OK under fair use?

The Defendants’ counsel responded, basically asking “what’s the harm?” and arguing that the Plaintiffs don’t show that the removal of CMI injured them (i.e., the New York Times isn’t any worse off because copyright management info was removed). Microsoft’s counsel added that they didn’t do this, with the unspoken subtext that the CMI removal was entirely OpenAI’s doing.

After the Judge asked the Plaintiffs “what’s the injury?,” their counsel noted that the removal of CMI enables widespread infringement; it provides copyrighted content without proper attribution. I’m paraphrasing here, but he asked questions along the lines of “how do we know where the copyrighted content came from?” and noted that it makes it much harder to identify the infringing use because it was willfully hidden. Plaintiffs’ counsel goes on, stating that the law provides statutory standing without the parties needing to show injury. He noted that, as addressed by the 2nd Circuit, the statute states that intentional removal of copyright notices (regardless of whether or not distribution occurs) is prohibited.

Another Plaintiffs’ counsel adds that the unjust enrichment of Microsoft and OpenAI is the injury sufficient to establish standing.

Unfair Competition by Misappropriation

The Defendants’ counsel supported dismissal of this claim due to the fact that lack of attribution is a specific requirement required to bring this claim. He stated that all of the NYT’s examples contain attribution. He also discussed the time-sensitive nature of the content, noting that all of the information is at least months old, with the value of it quickly dissipating.

The Plaintiffs’ counsel countered this point, giving multiple examples of day-old content that was output by the Defendants.

Trademark Dilution

The parties were about 2 hours in at this point, and I got the sense that the Judge was unhappy with some of the long-winded and/or not currently relevant arguments. With respect to the trademark dilution claim, it boiled down to a procedural question about the trademark claim. Judge Stein told Microsoft to come back with clarification as to why the Federal Rules of Civil Procedure Rule 5.1 wasn’t followed in their prior defense of the trademark dilution.

Conclusion

The hearing concluded rather abruptly (though I have to admit, I’ve not listened in on many oral arguments for Motions to Dismiss), with the Judge thanking everyone for their time and stating that he would have his decision ready “in due course”.

Listening to the entire hearing, my suspicion is that the Judge will not dismiss any of the claims. The most likely claim to be dismissed might be the direct infringement due to the statute of limitations.

Related Posts

View All Posts »

Discover the crucial legal aspects of building a company with an eye towards future sale. Learn about entity selection, jurisdiction considerations, intellectual property protection, and more from an experienced entrepreneur and advisory board member.