Filing says Zuckerberg approved Meta's use of copyrighted material in Llama training

AI has taken over the world and seems to be the trendy topic among our Silicon Valley tech overlords. Even though the term "AI" has been around since the late 1950s, its usage has absolutely exploded in popularity, thanks in no small part to ChatGPT, OpenAI's popular large language model (LLM) chatbot.

LLMs like ChatGPT are all the rage, but not everyone is thrilled about their development. The models behind these chatbots are trained on human-sourced datasets, which can include copyrighted material. Creatives, understandably, aren't too happy about their work being fed into the AI machine without pay or consent.

For instance, some users fled Twitter for Bluesky after learning that Elon Musk was training his AI, Grok, on Twitter posts. A few months back, Bluesky users were outraged when a Hugging Face employee uploaded a dataset containing 1 million Bluesky posts for AI training. Copyright concerns in AI are no joke, and these issues have already led to multiple lawsuits.

A new lawsuit was filed with the U.S. District Court for the Northern District of California, accusing Meta of training its Llama AI models on a dataset of pirated ebooks and articles, allegedly with Mark Zuckerberg’s approval.

The plaintiffs, including Sarah Silverman and Ta-Nehisi Coates, claim that Meta used LibGen, a self-described "links aggregator," as a dataset for training Llama. The court document states:

The 'Libgen dataset' is a shadow, or pirated, dataset that contains works by such large publishers as Cengage Learning, Macmillan Learning, McGraw Hill, and Pearson Education (who all sued to block piracy by LibGen).

Apparently, Meta testified last year that it had used LibGen to train its Llama models with permission from Zuckerberg. The lawsuit also alleges that after Meta scraped data from LibGen, it attempted to strip all copyright information from the materials it had taken.

It is now clear that Meta illegally stripped Copyright Management Information ('CMI') from Plaintiffs’ asserted works used to train its Llama models in order to facilitate and conceal widespread copyright infringement.

The plaintiffs argue that Meta’s decision to scrape LibGen and use its data for training Llama constitutes a violation of the California Comprehensive Computer Data Access and Fraud Act (CDAFA).

Adding fuel to the fire, Meta's chief AI scientist, Yann LeCun, sparked backlash last year when he suggested on X (formerly Twitter) that book authors should make their works freely available.

Only a small number of book authors make significant money from book sales.
This seems to suggest that most books should be freely available for download.
The lost revenue for authors would be small, and the benefits to society large by comparison. https://t.co/4ObkW1tm85
— Yann LeCun (@ylecun) January 1, 2024

We don't yet have a verdict for Meta's case, but the battle is far from over. As AI continues to weave itself deeper into our lives, expect more lawsuits like this.