Bluesky may not train AI on your posts, but others can, and users are furious

Bluesky has positioned itself as a haven for users who are frustrated with how platforms like X and Meta handle user content, particularly in training AI models. It’s built on the decentralized AT Protocol, which is supposed to give users more control and transparency. Yet, a recent incident has shown how being open-source and decentralized has its downsides.

Daniel van Strien, a machine learning librarian at Hugging Face, compiled a dataset of one million Bluesky posts using Bluesky’s Firehose API. This dataset wasn’t anonymized; it included user content along with decentralized identifiers (DIDs), which made it traceable. His goal was to support machine learning research and experimentation with social media data. The dataset quickly became popular on Hugging Face, a platform that hosts open-source AI tools, and it has been trending among other projects for a while.

Van Strien posted about the dataset on Bluesky, and users reacted strongly. Many of them are vocal about their opposition to AI training on their posts, a stance that aligns with Bluesky’s policy. The platform explicitly states it doesn’t use user content for training generative AI models, though it does rely on AI for moderation and feed algorithms. This dataset, however, became a major point of controversy, triggering a wave of criticism. Users argued that their posts were being used without consent, violating the principles Bluesky was founded on.

Van Strien eventually removed the dataset and issued an apology. He admitted that while his intentions were to advance tools for the Bluesky platform, the lack of transparency and user consent in his approach was a mistake. The repository hosting the project remains up on Hugging Face, but the dataset itself is no longer available.

I"ve removed the Bluesky data from the repo. While I wanted to support tool development for the platform, I recognize this approach violated principles of transparency and consent in data collection. I apologize for this mistake.
— Daniel van Strien (@danielvanstrien.bsky.social) 2024-11-27T02:19:57.958Z

Bluesky"s open-source and public architecture allows third parties to use its data freely, including for purposes the platform and its users may strongly oppose. Bluesky’s Firehose API, which streams all public posts in real time, was instrumental in this dataset"s creation. While it’s a feature designed for transparency and innovation, it also opens doors for potential misuse.

Bluesky’s response has been measured but clear. A spokesperson (via 404Media) compared the platform to the open internet, where public data can be indexed and used, sometimes against the wishes of the original creators. They expressed interest in developing ways for users to signal whether they consent to their content being used in such projects, but no concrete solutions are in place yet.

The irony is that many users left platforms like X to escape having their content used for AI training. X and Meta have openly added clauses to their terms of service allowing such use. Bluesky, with its decentralized model, seemed like the antidote. Now, users realize that decentralization doesn’t necessarily protect them from third parties doing what they please with public data.

The debate has been intense, with the controversy echoing the kinds of public uproars that were common on old Twitter. For Bluesky, it may be its first major "pitchfork-wielding" controversy. It’s a telling moment for the platform, which is still in its early stages of growth and figuring out how to navigate the challenges that come with its unique setup.

Tags