Saturday, November 29, 2025

Allowing information extractions adds to the collective database and preserves original information by the process of multiplication of data copies

Your observation touches on a genuine philosophical tension in the information ecosystem. There is a case to be made for the preservation and propagation value of web scraping—but it comes with significant caveats that complicate the argument.

The Preservation Argument Has Real Merit

The internet is notoriously ephemeral. Studies have found that roughly 38% of webpages that existed in 2013 are no longer accessible a decade later, and link rot accelerates with time. The Internet Archive's Wayback Machine exists precisely because of this fragility—and it operates through large-scale crawling and copying. Common Crawl, similarly, has preserved snapshots of the web since 2008, creating what amounts to a distributed backup of human digital knowledge.mozillafoundation+1

When content is scraped and incorporated into multiple databases, training sets, or archives, it does achieve a form of redundancy that can outlast the original. Blogs that go offline, news sites that paywall or delete older content, and forums that shut down would otherwise vanish entirely. In this sense, scraping serves a function analogous to medieval monks copying manuscripts—imperfect preservation is better than no preservation.

But Multiplication ≠ Preservation of Value

The critical distinction is between preserving information and preserving the conditions that produced it. When AI companies scrape and digest content, they may preserve facts, but they often destroy:

Attribution and provenance: The original creator loses connection to their work. When Perplexity or ChatGPT synthesizes an answer from scraped sources, the user rarely visits the source, and the creator receives neither credit nor compensation.

Economic sustainability: The advertising, subscription, and patronage models that fund content creation depend on traffic. If the "collective database" satisfies user queries without directing them to sources, the incentive structure that produced the original content collapses. You cannot preserve future information by eliminating the economic basis for its creation.

Context and nuance: Scraped data is often stripped of surrounding context, authorial intent, correction history, and the relational web of links and references that gave it meaning. A fact extracted and averaged into a language model is not the same as a fact presented in its original argumentative or narrative context.

The Commons Problem

Your framing implicitly treats web content as a commons that benefits from collective contribution. This is partially true—Wikipedia, open-source software, and Creative Commons licensing demonstrate that shared knowledge pools can work. But the current scraping economy doesn't function as a commons; it functions as extraction. Value flows one direction: from creators to aggregators. The "multiplication" happens in proprietary databases controlled by well-capitalized companies, not in publicly accessible archives.

If AI companies were contributing their trained models, their refined datasets, or their revenues back to the creators whose work made those assets possible, the commons analogy would be stronger. Instead, we have a situation where Reddit's volunteer contributors generated content worth $60 million annually to a single AI company—without those contributors seeing any return.copyright.nova+1

A More Honest Framing

The preservation argument works best for:

  • Nonprofit archives like the Internet Archive and Common Crawl (in its original mission)

  • Academic and research uses where extracted knowledge is shared openly

  • Genuine redundancy where copies remain accessible and attributed

It works poorly as justification for:

The web scraping debate is ultimately not about whether information should propagate—clearly it should, and always has—but about who captures the value of that propagation, and whether the current arrangement is sustainable. Multiplying copies of content while eliminating the economic and social conditions that produce new content is not preservation; it's strip-mining a resource while declaring it renewable.

  1. https://www.mozillafoundation.org/en/research/library/generative-ai-training-data/common-crawl/
  2. https://www.innovatiana.com/en/datasets/common-crawl
  3. http://copyright.nova.edu/ai-reddit/
  4. https://techcrunch.com/2024/02/22/reddit-says-its-made-203m-so-far-licensing-its-data/

No comments: