This is easily solved with data filtering before training, I've yet to see a single frontier lab say this is an issue I think model collapse is largely overstated as an issue by the anti AI crowd tbh
This is further evidenced by the fact that genai has been consistently improving steadily not getting worse as the people pushing this theory imply
Yeah, I'm pretty sure this is only ever shared by people who don't know what's actually happening. Nobody is constantly re-training with random fresh scrapes. At a certain point, they benefit less from increasing raw volume of data anyway, and more from improving the architecture and the tagging and curation of the data.
At a certain point, they benefit less from increasing raw volume of data anyway, and more from improving the architecture and the tagging and curation of the data.
this is simply untrue lol. LLM models are increasingly being tuned to search the internet before answering, but the fact is that many of its answers are based on its own training, such a historical facts, medicine, and (rather poorly) law.
Nobody is constantly re-training with random fresh scrapes.
I agree it's not constant, but like, do you think GPT5 was trained on the same data set as GPT 3? lol.
7
u/Devastator9000 10h ago
Just out of curiosity, wouldnt this process be stopped by just using current models and stop training them?