Databricks Faces Potential 'Extraordinary' Damages in Copyright Lawsuit Over AI Training Data

Authors win court battle to proceed with lawsuit claiming Databricks trained its DBRX AI model on pirated copyrighted books, with potential damages reaching up to $150,000 per work.

A class action lawsuit against Databricks can proceed after a federal judge denied the company's motion to dismiss, allowing authors to pursue claims that their copyrighted works were used without permission to train the company's DBRX large language model.

The lawsuit centers on allegations that Databricks acquired technology from MosaicLM in 2023, which had trained its models using the RedPajama dataset—a collection containing approximately 196,000 books, including Book3, which has since been removed from Hugging Face for copyright infringement. The authors claim this pirated content was instrumental in developing DBRX, despite Databricks' assertions that the final model wasn't trained on the infringing materials.

"They directly tie their infringed works to DBRX, and the employee statements provide supporting inferences when read in context, particularly when viewed alongside other more direct statements," wrote Judge Charles Breyer in his ruling from U.S. District Court in Northern California.

Legal Basis for the Claim

The lawsuit hinges on fundamental copyright principles. Under U.S. copyright law, authors need only demonstrate that their works were copyrighted and that those works were copied by Databricks. The company's argument that experimentation with infringing materials doesn't constitute infringement if those materials aren't in the final product has been rejected by the judge.

"Databricks copied Books3 multiple times in the process of developing its DBRX models and by so doing, directly infringed Plaintiffs' copyrights in the asserted works," the authors stated in their legal filing. "Under Defendants' logic, as long as an AI company does not incorporate copyrighted books into the final training dataset of a model, it is free to download, store, reproduce, and indefinitely use pirated works for its own benefit. That argument gets it backwards."

The potential damages could be substantial. Copyright law provides for statutory damages ranging up to $150,000 per willfully infringed work. Brandon Butler, a copyright lawyer and executive director of Re:Create, noted that "the damages provisions in copyright law are draconian with a capital D. I mean, they are extraordinary."

Impact on Authors and the Industry

Several prominent authors have joined the lawsuit, including best-selling young adult author Jason Reynolds, novelist Stuart O'Nan, horror writer Brian Keene, and Rebecca Makkai, whose novel "The Great Believers" was a Pulitzer Prize finalist. These authors represent a growing concern among creative professionals about how their work is being used in the rapidly expanding AI industry.

"This is bet-the-company litigation," Butler warned. "If they win, they could get enough damages they just liquidate every asset that belongs to some of these companies, and probably especially a smaller player like Databricks."

The case comes amid increasing scrutiny of AI companies' training practices. Similar lawsuits against Meta and Anthropic have been decided in favor of the AI companies using fair use arguments, though Anthropic subsequently agreed to establish a $1.5 billion fund to compensate authors. However, Databricks has not yet raised the fair use defense in this case.

Broader Implications for AI Development

The legal battle raises significant questions about the future of AI development and the boundaries of fair use. If the court rules in favor of the authors, it could establish a precedent requiring AI companies to be more transparent about their training data and potentially obtain proper licenses for copyrighted materials.

This case also intersects with evolving data protection regulations worldwide. While GDPR and CCPA primarily focus on personal data rather than copyrighted content, they reflect a growing recognition that data usage requires proper consent and transparency—principles that could extend to creative works in future regulatory frameworks.

Judge Breyer's decision to demand more information indicates the complexity of the case. Despite Databricks providing fourteen depositions, thousands of pages of documents, and terabytes of discovery information, the court needs clearer evidence about exactly what data was used and how it contributed to the final model.

As AI continues to evolve, cases like this one will help shape the legal boundaries of technology development, potentially requiring companies to rethink their approach to training data and implement more robust compliance mechanisms to protect both innovation and creators' rights.

Databricks Faces Potential 'Extraordinary' Damages in Copyright Lawsuit Over AI Training Data

Legal Basis for the Claim

Impact on Authors and the Industry

Broader Implications for AI Development

Comments