Member-only story

Every time OpenAI cuts a check for training data, an unlaunched competitive startup dies. Without a ‘safe harbor,’ AI will be ruled by incumbents.

3 min readFeb 23, 2024

The checks being cut to ‘owners’ of training data are creating a huge barrier to entry for challengers. If Google, OpenAI, and other large tech companies can establish a high enough cost, they implicitly prevent future competition. Not very Open.

Model efficacy is roughly [ technical IP/approach] * [ training data] * [ training frequency/feedback loop]. Right now I’m comfortable betting on innovation from small teams in the ‘approach,’ but if experimentation is gated by nine figures worth of licensing deals, we are doing a disservice to innovation.

These business deals are a substitute for unclear copyright and usage laws. Companies like the New York Times are willing to litigate this issue (at least as a negotiation strategy). It’s likely that our regulations need to update ‘fair use.’ I need to think more about where I land on this — companies which exploit/overweight a data source that wasn’t made available to them for commercial purposes do owe the rights owner. Rights owners should be able to automatically set some sort of protections for at least a period of time (similar to Creative Commons or robots.txt). I don’t believe ‘if it can be scraped, it’s yours to use’ and I also don’t believe that once you create something you lose all rights to how it can be commercialized.

What I do believe is that we need to move quickly to create a ‘ safe harbor’ for AI startups to experiment without fear of legal repercussions so long as they meet certain conditions. As I wrote in April 2023,

“What would an AI Safe Harbor look like? Start with something like, “For the next 12 months any developer of AI models would be protected from legal liability so long as they abide by certain evolving standards.” For example, model owners must:

Transparency: for a given publicly available URL or submitted piece of media, to query whether the top level domain is included in the training set of the model. Simply visibility is the first step — all the ‘do not train on my data’ (aka robots.txt for AI) is going to take more thinking and tradeoffs from a regulatory perspective.

Every time OpenAI cuts a check for training data, an unlaunched competitive startup dies. Without a ‘safe harbor,’ AI will be ruled by incumbents.

Written by Hunter Walk

Responses (8)