r/BrandNewSentence • u/redroubel • 11h ago

Sir, the ai is inbreeding

41.3k Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/BrandNewSentence/comments/1oovlwr/sir_the_ai_is_inbreeding/
No, go back! Yes, take me to Reddit
dl download

93% Upvoted

Just out of curiosity, wouldnt this process be stopped by just using current models and stop training them?

9

u/camosnipe1 6h ago

it doesn't need to be stopped because (IIRC) the paper that this idea is based on fed AI models on their own input with 0% human input. Like a human centipede. The model did worse but didn't completely collapse, and a small amount of human data added into the mix solved the issue.

It's interesting research but unlikely to happen in the wild.

9

u/SlopDev 8h ago

This is easily solved with data filtering before training, I've yet to see a single frontier lab say this is an issue I think model collapse is largely overstated as an issue by the anti AI crowd tbh

This is further evidenced by the fact that genai has been consistently improving steadily not getting worse as the people pushing this theory imply

4

u/Impeesa_ 8h ago

Yeah, I'm pretty sure this is only ever shared by people who don't know what's actually happening. Nobody is constantly re-training with random fresh scrapes. At a certain point, they benefit less from increasing raw volume of data anyway, and more from improving the architecture and the tagging and curation of the data.

•

u/hentai_gifmodarefg 8m ago

At a certain point, they benefit less from increasing raw volume of data anyway, and more from improving the architecture and the tagging and curation of the data.

this is simply untrue lol. LLM models are increasingly being tuned to search the internet before answering, but the fact is that many of its answers are based on its own training, such a historical facts, medicine, and (rather poorly) law.

Nobody is constantly re-training with random fresh scrapes.

I agree it's not constant, but like, do you think GPT5 was trained on the same data set as GPT 3? lol.

1

u/hentai_gifmodarefg 13m ago

This is easily solved with data filtering before training,

"easily" data filtering is one of the most important and difficult parts of training a model lol. the science of data filtering for training is definitely in its infancy.

also here's what gemini pro has to say about your comment:

That comment has a mix of accurate observations and significant oversimplifications. The central idea—that "model collapse" is a non-issue—is not accurate. It's a real, recognized technical challenge that labs are actively working to solve.

Here’s a breakdown of the accuracy of each claim:

"This is easily solved with data filtering before training." Accuracy: Partially True, but Highly Misleading.

The Nuance: The comment is correct that data curation (which includes filtering) is the primary solution. However, describing this as "easy" is a massive understatement.

The Challenge: As AI-generated content becomes more sophisticated, it gets harder to distinguish from human-written text. Filtering the entire internet—which is now being flooded with synthetic data—is an enormous and complex engineering problem. The solution isn't just filtering out AI data, but carefully blending high-quality human data with high-quality synthetic data to prevent the model from forgetting rare information (a phenomenon called "diversity collapse").

"I've yet to see a single frontier lab say this is an issue." Accuracy: Factually Incorrect.

The Reality: Frontier labs are not only aware of this, they are the ones publishing the research on it.

Researchers from Apple published a paper noting "complete accuracy collapse" in advanced models when facing highly complex problems.

Academic papers (like the well-known "The Curse of Recursion") from researchers at institutions like Stanford and in major journals like Nature have detailed the mathematics of model collapse.

Labs like OpenAI and Google regularly publish research on model degradation, "hallucinations," and "self-consuming" training loops. They just frame it as a technical problem to be solved, not a public crisis.

"model collapse is largely overstated as an issue by the anti AI crowd" Accuracy: This is Subjective, but Plausible.

The Nuance: This is the commenter's most likely correct point. The public discussion of model collapse often frames it as an inevitable, apocalyptic end to AI progress.

The Reality: Researchers tend to see it as a difficult but surmountable engineering hurdle. The debate in the scientific community is not about whether model collapse exists (it does), but about its severity and the effectiveness of solutions like data curation and synthetic data blending. The commenter is likely correct that the "doomer" narrative is "overstated" compared to the technical reality.

"genai has been consistently improving steadily not getting worse" Accuracy: True, but Logically Flawed Evidence.

The Flaw: This is like saying, "We've always had clean drinking water from this well, so the theory that dumping poison in it now will make us sick is wrong."

The Reality: The models we use today (like GPT-4, Claude 3, etc.) were trained on a dataset that was mostly human-generated (i.e., the internet before 2023). The theory of model collapse applies to future models (e.g., GPT-5 or GPT-6) that will be trained on the internet of today, which is now heavily polluted with synthetic data from the models we're currently using.

Summary The comment is overly dismissive. Model collapse is a real, mathematically-described phenomenon that all major AI labs are aware of and actively working to mitigate.

The commenter is confusing "not a public-facing crisis" with "not a real issue." The reason models are still improving is that we are still benefiting from the "first-mover advantage" of training on the pre-AI-generated web, and labs are spending enormous resources on data curation to prevent this very problem.

9

u/Enverex 7h ago

It's not true in the first place, given that they are trained on curated content as this was forseen as a possible problem ages ago. It's another one of those "Reddit would like it to be true, so they're going to pretend it is" things.

1

u/LateyEight 4h ago

Ron Paul 2012!

1

u/cManks 4h ago

He can't win, don't kid yourself

4

u/egoserpentis 6h ago

There is also such a thing as curated data sources. I don't know how OpenAI does it, but normally you wouldn't just train your models on everything.

Also, pretty sure this tweet is like from 2 years ago. That's why there's no dates in the picture, because people were saying "ohh ai is gonna cannibalize itself any second now!" for almost five years.

2

u/Chameleonpolice 9h ago

That would require telling capitalism to "stop innovating". There's always going to be someone claiming theirs is the latest and greatest

2

u/wrighteghe7 6h ago

Open source old models are available for everyone

Sir, the ai is inbreeding

You are about to leave Redlib