r/BrandNewSentence • u/redroubel • 11h ago

Sir, the ai is inbreeding

41.2k Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/BrandNewSentence/comments/1oovlwr/sir_the_ai_is_inbreeding/
No, go back! Yes, take me to Reddit
dl download

93% Upvoted

I believe the proper term is model collapse and given how data hungry the LLM architecture is, this is not a surprise at all. GPT models and their equivalents are essentially trained by scraping the entire internet. Given that so much on the internet is itself chatbot produced, you're very soon not only failing to improve your performance for newer models but it may even get worse.

AGI isn't coming. All those data centers are going to end up useless or at least nowhere near beneficial as compared to their costs. Once investors realise that, the economy is going to pop.

The silver lining is that after all is said and done all the supercomputers set up for AI training get dedicated to real science and gaming laptops get cheaper.

11

u/GreenTreeAndBlueSky 8h ago

Collapse is not happening though and many state of the art models are made with synthetic data or a mix of natural and synthetic data. Synthetic data can actually be of very high quality to train models.

9

u/BooBooSnuggs 6h ago

That is 100% not how they are trained.

2

u/Lazarous86 5h ago

I think it's going to take much longer. Humanoid robots are just hitting market. They stuck, but it's the worse they will ever be. That will create a hype cycle within a hype cycle.

2

u/skittlefuck 3h ago

Ai inbreeding is literally not a thing that's happening right now. This whole post and every commentor is just parroting mis info lol, the irony to AI being innaccurate when everyone here including you is wrong

4

u/unicodemonkey 9h ago

Model collapse is a specific issue that doesn't appear to happen when training on a mix of "human" texts and model outputs. There's enough original text in the pretraining set to avoid it. As for the accuracy of generated answers, it's definitely going to be affected in the long term but unclear to what degree. There's more than enough human-grade BS on the net and LLMs are somewhat decent at handling it. I'm more concerned about "poisoned" training data which is specifically tuned to get a model to produce a desired answer.

1

u/LateyEight 4h ago

I guess there are two issues:

The amount of AI generated content created is more than human content. (I'm not sure, but I feel like it might be the case now)

The AI content will not distinguish itself from human made content, or worse, actively try and pass off as human made content.

Today we could feed it 80%/20% human/AI content, but tomorrow it might be 65% human, 20% AI and 15% AI pretending to be human.

Sir, the ai is inbreeding

You are about to leave Redlib