Every time OpenAI cuts a check for training data, an unlaunched competitive startup dies. Without a ‘safe harbor,’ AI will be ruled by incumbents.
The checks being cut to ‘owners’ of training data are creating a huge barrier to entry for challengers. If Google, OpenAI, and other large tech companies can establish a high enough cost, they implicitly prevent future competition. Not very Open.
Model efficacy is roughly [ technical IP/approach] * [ training data] * [ training frequency/feedback loop]. Right now I’m comfortable betting on innovation from small teams in the ‘approach,’ but if experimentation is gated by nine figures worth of licensing deals, we are doing a disservice to innovation.
These business deals are a substitute for unclear copyright and usage laws. Companies like the New York Times are willing to litigate this issue (at least as a negotiation strategy). It’s likely that our regulations need to update ‘fair use.’ I need to think more about where I land on this — companies which exploit/overweight a data source that wasn’t made available to them for commercial purposes do owe the rights owner. Rights owners should be able to automatically set some sort of protections for at least a period of time (similar to Creative Commons or robots.txt). I don’t believe ‘if it can be scraped, it’s yours to use’ and I…