OpenAI has published an official response to the lawsuit filed by The New York Times claiming that the company used its articles without permission to train Large Language Model (LLM).
In a letter published by OpenAI, the company refuted The New York Times"s claims while noting that the publication fabricated the prompts to regurgitate data related to Times articles. Regurgitation is a process where AI models provide training data verbatim when asked in a certain way.
Interestingly, the regurgitations The New York Times induced appear to be from years-old articles that have proliferated on multiple third-party websites. It seems they intentionally manipulated prompts, often including lengthy excerpts of articles, in order to get our model to regurgitate. Even when using such prompts, our models don’t typically behave the way The New York Times insinuates, which suggests they either instructed the model to regurgitate or cherry-picked their examples from many attempts.
The company mentions that they had no information about the lawsuit and came to know about it when they read it in The New York Times.
We had explained to The New York Times that, like any single source, their content didn"t meaningfully contribute to the training of our existing models and also wouldn"t be sufficiently impactful for future training. Their lawsuit on December 27—which we learned about by reading The New York Times—came as a surprise and disappointment to us.
OpenAI also said that the Times had mentioned cases of regurgitation when the two parties were working together but failed to provide examples when asked about it. The company noted that they treat allegations of regurgitation with the utmost priority and provided example of removal of Bing Integration to support their claim.
Along the way, they had mentioned seeing some regurgitation of their content but repeatedly refused to share any examples, despite our commitment to investigate and fix any issues. We’ve demonstrated how seriously we treat this as a priority, such as in July when we took down a ChatGPT feature immediately after we learned it could reproduce real-time content in unintended ways.
The letter also focused on other points including the licensing deal between news agencies like the Associated Press, Axel Springer, American Journalism Project and NYU. OpenAI also talked about fair use saying that if the content is available on the internet, it comes within the fair use regulation and can be used for training AI models.
Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents. We view this principle as fair to creators, necessary for innovators, and critical for US competitiveness.
However, OpenAI does provide an opt out option if a someone does not want their data to be used for training AI models. It also noted that The New York Times had exercised that option in August 2023.
That being said, legal right is less important to us than being good citizens. We have led the AI industry in providing a simple opt-out process for publishers (which The New York Times adopted in August 2023) to prevent our tools from accessing their sites.
The New York Times is not the only one suing OpenAI and Microsoft for unauthorized use of data. Earlier this week, two authors also filed a lawsuit claiming that OpenAI used their published work to train its AI models.