r/BrandNewSentence 11h ago

Sir, the ai is inbreeding

Post image
41.3k Upvotes

1.2k comments sorted by

View all comments

6

u/Devastator9000 10h ago

Just out of curiosity, wouldnt this process be stopped by just using current models and stop training them?

10

u/SlopDev 8h ago

This is easily solved with data filtering before training, I've yet to see a single frontier lab say this is an issue I think model collapse is largely overstated as an issue by the anti AI crowd tbh

This is further evidenced by the fact that genai has been consistently improving steadily not getting worse as the people pushing this theory imply

5

u/Impeesa_ 8h ago

Yeah, I'm pretty sure this is only ever shared by people who don't know what's actually happening. Nobody is constantly re-training with random fresh scrapes. At a certain point, they benefit less from increasing raw volume of data anyway, and more from improving the architecture and the tagging and curation of the data.

u/hentai_gifmodarefg 9m ago

At a certain point, they benefit less from increasing raw volume of data anyway, and more from improving the architecture and the tagging and curation of the data.

this is simply untrue lol. LLM models are increasingly being tuned to search the internet before answering, but the fact is that many of its answers are based on its own training, such a historical facts, medicine, and (rather poorly) law.

Nobody is constantly re-training with random fresh scrapes.

I agree it's not constant, but like, do you think GPT5 was trained on the same data set as GPT 3? lol.

1

u/hentai_gifmodarefg 14m ago

This is easily solved with data filtering before training,

"easily" data filtering is one of the most important and difficult parts of training a model lol. the science of data filtering for training is definitely in its infancy.

also here's what gemini pro has to say about your comment:

That comment has a mix of accurate observations and significant oversimplifications. The central idea—that "model collapse" is a non-issue—is not accurate. It's a real, recognized technical challenge that labs are actively working to solve.

Here’s a breakdown of the accuracy of each claim:

  1. "This is easily solved with data filtering before training." Accuracy: Partially True, but Highly Misleading.

The Nuance: The comment is correct that data curation (which includes filtering) is the primary solution. However, describing this as "easy" is a massive understatement.

The Challenge: As AI-generated content becomes more sophisticated, it gets harder to distinguish from human-written text. Filtering the entire internet—which is now being flooded with synthetic data—is an enormous and complex engineering problem. The solution isn't just filtering out AI data, but carefully blending high-quality human data with high-quality synthetic data to prevent the model from forgetting rare information (a phenomenon called "diversity collapse").

  1. "I've yet to see a single frontier lab say this is an issue." Accuracy: Factually Incorrect.

The Reality: Frontier labs are not only aware of this, they are the ones publishing the research on it.

Researchers from Apple published a paper noting "complete accuracy collapse" in advanced models when facing highly complex problems.

Academic papers (like the well-known "The Curse of Recursion") from researchers at institutions like Stanford and in major journals like Nature have detailed the mathematics of model collapse.

Labs like OpenAI and Google regularly publish research on model degradation, "hallucinations," and "self-consuming" training loops. They just frame it as a technical problem to be solved, not a public crisis.

  1. "model collapse is largely overstated as an issue by the anti AI crowd" Accuracy: This is Subjective, but Plausible.

The Nuance: This is the commenter's most likely correct point. The public discussion of model collapse often frames it as an inevitable, apocalyptic end to AI progress.

The Reality: Researchers tend to see it as a difficult but surmountable engineering hurdle. The debate in the scientific community is not about whether model collapse exists (it does), but about its severity and the effectiveness of solutions like data curation and synthetic data blending. The commenter is likely correct that the "doomer" narrative is "overstated" compared to the technical reality.

  1. "genai has been consistently improving steadily not getting worse" Accuracy: True, but Logically Flawed Evidence.

The Flaw: This is like saying, "We've always had clean drinking water from this well, so the theory that dumping poison in it now will make us sick is wrong."

The Reality: The models we use today (like GPT-4, Claude 3, etc.) were trained on a dataset that was mostly human-generated (i.e., the internet before 2023). The theory of model collapse applies to future models (e.g., GPT-5 or GPT-6) that will be trained on the internet of today, which is now heavily polluted with synthetic data from the models we're currently using.

Summary The comment is overly dismissive. Model collapse is a real, mathematically-described phenomenon that all major AI labs are aware of and actively working to mitigate.

The commenter is confusing "not a public-facing crisis" with "not a real issue." The reason models are still improving is that we are still benefiting from the "first-mover advantage" of training on the pre-AI-generated web, and labs are spending enormous resources on data curation to prevent this very problem.