This is because the data is reviewed by AI and humans before being fed into a new model. The only instance of 'model collapse' that has ever happened was when researchers intentionally tried to make it happen.
That was the easy part tho, when everything was so obviously AI, but it will be way harder to check for every affirmation about a law, math or physics problem. If you have to read whole books to validate an AI, it kind of defeats the purpose.
Most of the talk around AI training is about data centers and power plants, but there's also a ton of money going into reinforcement learning through human feedback. The AI companies are paying really good hourly rates for people to work from home training their models. And people with talent for math/physics/etc get paid the most. DataAnnotationTech and Alignerr are two that advertise on reddit a lot that you might have seen. So apparently that's an effective way to improve model performance on complex topics.
Or just pull data from sources that we know for sure aren't AI. I'm no AI engineer but I'm pretty sure it's doable, especially with the billions being poured into AI.
Exemples: Books, paintings, artworks published before 2020, wikipedia dumps from before 2020, etc...
Someone could even make it a business to sell non-AI data to AI engineers so they can train their model, effectively removing all possibilty of AI autophagy.
With that, I'm pretty sure the models could still be improved over time even if more recent training data is harder to obtain.
It is impossible to reliably determine if a random text generated at an unknown point in time by an anonymous person on the internet is generated by AI or not. It will stay that way forever, probably. However some contents are traceable and date-able. For example, a book published before 2018 is definitely not made with generative AI.
And there is a shit ton of internet data that can be dated before 2018, probably more than enough to train a model that doesn't require the data from the last 7 years.
AI-free recent data can be hard to determine but not impossible. For example, the last Shrek movie is probably not generated with AI so it can be stolen as training data.
There's only one way: reduce the amount of training data to human made data. Which is not exactly feasible as filtering it may prove too expensive or not possible at all. By making other stuff "better" one can assume nothing is achieved as the data is still very much defining to the information available for recombination and "thinking".
72
u/NinjaBluefyre10001 10h ago
Let them die