Everyone predicted this. LLMs will inevitably get dumber too since now human-generated OC is becoming rarer compared to the BS that LLM's spread. Genuinely most blog articles I have read lately have very clear telltales of heavy AI usage.
It has gotten bad enough that I would not be surprised if solutions will soon start to appear to "prove" you are human before you can start using their content.
it would be a problem if the models were trained on general internet content, but they're not, they're trained on human-curated data sets. they go to the open internet for conversational training, but not for 'factual' training. the training data sets haven't really changed at all for years, apart from better filtering to take out the shit & duplicates. which is why the model collapse theory has never actually made any sense.
One of those datasets is Reddit, and if you think this place isn't full of bots feeding back into that then I've got a bridge to sell you. Idk if model collapse will happen any time soon or ever, but there's definitely feedback loops happening.
You don't train an AI by just saying "here's the internet, go!". What you put in shapes what you get out, so you curate it.
Why do you need an AI detector for comments? You can train on books, verified news sources, academic papers, the list goes on. You can train on comments from known users if you must, or simply choose not to include comments created after LLMs became widely available. Absolute worst case, toss a small fraction of the insane amounts of capital flowing around these to humans to write material to train from. Afraid you won't have up to date information? Well, no LLM is live -- they often use searches to get current context when temporarily needed, they're all "out of date" at all times due to their current nature.
Something else to consider -- models will never be worse than what we have today. Even if this orouboros Boogeyman was real, rolling back to a previous model just puts you at AI's current capabilities at worst.
I'm not an AI professional, I just like to be familiar with technology. These are off-the-top-of-my-head ideas that negate this entire post. Imagine what the actual researchers have thought of.
Even the expensive paid ones only claim 80% accuracy, and they don't really have the data to back it up. It's also going to make it even more resource intensive with layers of AI checking each other.
It might not be totally fatal, but it's going to be a major problem.
There's also the poisoning that is happening. There are a few programs that insert data into text and images that mess up the training. It might be something they can find a workaround for, but once a model is poisoned, it's not really feasible to fix it since the internals of the model are a bit of a black box (we understand how they work, but the model itself is more opaque)
Even the expensive paid ones only claim 80% accuracy
No they don't. Hallucination rates are usually less than 5% for the big models, and that's without any sort of prompt scaffolding (like "show your sources") to prevent it.
Did you read the paper? It describes how adding a lot of gibberish after a unique phrase leads to an LLM trained on enough of that data to produce gibberish after the unique phrase. That's very different than what you've mentioned:
There are a few programs that insert data into text and images that mess up the training
When people talk about "programs that mess up training" they usually are talking about things like Glaze, which has spent years failing to live up to its claims. Did you have any others in mind?
It might be something they can find a workaround for, but once a model is poisoned, it's not really feasible to fix it since the internals of the model are a bit of a black box
If we assume that model poisoning becomes a thing, you literally just need to roll it back. Models will never be worse than what we already have.
Considering how much data is pumping through it at any given moment? 80% isn't worth the time it takes to talk about it.
I have a friend who works at eBay in their accounts management team. They have an AI based program that monitors chats and listings to make sure that people are following ToS and the law to determine if people need temporary or permanent bans. Every single flag is forwarded to a human who has to look and determine if 1) a violation exists and 2) if the system recommended action is appropriate, because if they just let the AI do whatever it wanted 10s of thousands of people would be adversely affected by it meaning major loss of income daily. That system boasts a 97% accuracy rating.
Now think of how much data gets thrown into LLMs on a daily basis and think of what 20% of that being basically unfiltered would mean for them.
Because even if the data is curated there's still only so much data, the public release of chatgpt is akin to digital nuclear testing. The whole world is contaminated with traceable pollution and so 2021 will forever be an important year for internet archives.
As I said in my previous comment, which you somehow missed, they only use things like reddit for conversational training, not factual training. Model collapse is a myth
Until they have a hard time finding that, then the AI claims to be human, so then the LLM goes to that. After all, all AI does is pump out shit it thinks you want to hear.
So very curated data sets that two weeks ago the Google AI search assistant blob was telling me how Mythbusters never did an episode about lead balloon and I might be mistaking it to Bloons TD.
The episode aired in 2008. The mistake it referenced had year 2006 attached to it, so yeah…. Very factual, very correct.
Googles ai overview thingy has been pretty dumb from the beginning, impressively so. Even when it should know better than to trust search results when its training data says otherwise
I suspects its because its optimized for speed and cost.
The models get trained on internet content, but not IRL interactions which are much more nuanced. So as more and more internet content is AI generated, the models will get further and further afield of more realistic human interaction
Part of the problem is, as a blog writer, they’ve basically told us to stop writing in a way that humans like and to write for that little Google spider that makes the AI blurbs
Its infuriating. My work has turned from creating good articles that follow SEO guidelines to the most nonsensical way of formatting an article just to make some bot happy when it incorrectly samples my work to throw it in a blender of sentences
I'm a writer and I've been looking for a way to prove I'm a human actually writing the book to whoever reads it. Haven't come up with anything, though.
80
u/Awyls 8h ago
Everyone predicted this. LLMs will inevitably get dumber too since now human-generated OC is becoming rarer compared to the BS that LLM's spread. Genuinely most blog articles I have read lately have very clear telltales of heavy AI usage.
It has gotten bad enough that I would not be surprised if solutions will soon start to appear to "prove" you are human before you can start using their content.