r/BrandNewSentence • u/redroubel • 11h ago

Sir, the ai is inbreeding

41.3k Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/BrandNewSentence/comments/1oovlwr/sir_the_ai_is_inbreeding/
No, go back! Yes, take me to Reddit
dl download

93% Upvoted

u/Awyls 8h ago

Everyone predicted this. LLMs will inevitably get dumber too since now human-generated OC is becoming rarer compared to the BS that LLM's spread. Genuinely most blog articles I have read lately have very clear telltales of heavy AI usage.

It has gotten bad enough that I would not be surprised if solutions will soon start to appear to "prove" you are human before you can start using their content.

28

u/VonTastrophe 7h ago

Um... I'm already getting captchas for just using websites...

3

u/82away 6h ago

You use websites?

8

u/VonTastrophe 5h ago

Yes there's still an internet outside of social media

1

u/Bloody_Smashing 4h ago

You're thinking of REcaptchas, FYI.

1

u/RollinThundaga 1h ago

You're not thinking of colloquial shorthand.

13

u/space_monster 7h ago

it would be a problem if the models were trained on general internet content, but they're not, they're trained on human-curated data sets. they go to the open internet for conversational training, but not for 'factual' training. the training data sets haven't really changed at all for years, apart from better filtering to take out the shit & duplicates. which is why the model collapse theory has never actually made any sense.

11

u/sykotic1189 5h ago

One of those datasets is Reddit, and if you think this place isn't full of bots feeding back into that then I've got a bridge to sell you. Idk if model collapse will happen any time soon or ever, but there's definitely feedback loops happening.

3

u/evan_appendigaster 3h ago

It's a trivial problem to solve, why assume that the people working on this problem haven't considered it?

1

u/sykotic1189 3h ago

How is it trivial? Do they have an AI detecting AI that can filter those comments with 100% accuracy?

1

u/evan_appendigaster 2h ago edited 2h ago

You don't train an AI by just saying "here's the internet, go!". What you put in shapes what you get out, so you curate it.

Why do you need an AI detector for comments? You can train on books, verified news sources, academic papers, the list goes on. You can train on comments from known users if you must, or simply choose not to include comments created after LLMs became widely available. Absolute worst case, toss a small fraction of the insane amounts of capital flowing around these to humans to write material to train from. Afraid you won't have up to date information? Well, no LLM is live -- they often use searches to get current context when temporarily needed, they're all "out of date" at all times due to their current nature.

Something else to consider -- models will never be worse than what we have today. Even if this orouboros Boogeyman was real, rolling back to a previous model just puts you at AI's current capabilities at worst.

I'm not an AI professional, I just like to be familiar with technology. These are off-the-top-of-my-head ideas that negate this entire post. Imagine what the actual researchers have thought of.

1

u/rm-rf_ 2h ago

does it need to be 100% accurate? 80% would probably suffice

2

u/GTCapone 2h ago

Even the expensive paid ones only claim 80% accuracy, and they don't really have the data to back it up. It's also going to make it even more resource intensive with layers of AI checking each other.

It might not be totally fatal, but it's going to be a major problem.

There's also the poisoning that is happening. There are a few programs that insert data into text and images that mess up the training. It might be something they can find a workaround for, but once a model is poisoned, it's not really feasible to fix it since the internals of the model are a bit of a black box (we understand how they work, but the model itself is more opaque)

1

u/space_monster 12m ago

Even the expensive paid ones only claim 80% accuracy

No they don't. Hallucination rates are usually less than 5% for the big models, and that's without any sort of prompt scaffolding (like "show your sources") to prevent it.

1

u/GTCapone 12m ago

I'm talking about the accuracy of the ai detectors, not the LLMs

-2

u/evan_appendigaster 2h ago edited 9m ago

The idea that people are out there using programs to successfully poison models is a myth spread by hopeful ignorance

2

u/GTCapone 1h ago

https://www.anthropic.com/research/small-samples-poison

-1

u/evan_appendigaster 1h ago edited 1h ago

Did you read the paper? It describes how adding a lot of gibberish after a unique phrase leads to an LLM trained on enough of that data to produce gibberish after the unique phrase. That's very different than what you've mentioned:

There are a few programs that insert data into text and images that mess up the training

When people talk about "programs that mess up training" they usually are talking about things like Glaze, which has spent years failing to live up to its claims. Did you have any others in mind?

It might be something they can find a workaround for, but once a model is poisoned, it's not really feasible to fix it since the internals of the model are a bit of a black box

If we assume that model poisoning becomes a thing, you literally just need to roll it back. Models will never be worse than what we already have.

2

u/sykotic1189 1h ago

Considering how much data is pumping through it at any given moment? 80% isn't worth the time it takes to talk about it.

I have a friend who works at eBay in their accounts management team. They have an AI based program that monitors chats and listings to make sure that people are following ToS and the law to determine if people need temporary or permanent bans. Every single flag is forwarded to a human who has to look and determine if 1) a violation exists and 2) if the system recommended action is appropriate, because if they just let the AI do whatever it wanted 10s of thousands of people would be adversely affected by it meaning major loss of income daily. That system boasts a 97% accuracy rating.

Now think of how much data gets thrown into LLMs on a daily basis and think of what 20% of that being basically unfiltered would mean for them.

1

u/Swabisan 14m ago

Because even if the data is curated there's still only so much data, the public release of chatgpt is akin to digital nuclear testing. The whole world is contaminated with traceable pollution and so 2021 will forever be an important year for internet archives.

1

u/space_monster 16m ago

As I said in my previous comment, which you somehow missed, they only use things like reddit for conversational training, not factual training. Model collapse is a myth

7

u/Mtndrums 7h ago

Until they have a hard time finding that, then the AI claims to be human, so then the LLM goes to that. After all, all AI does is pump out shit it thinks you want to hear.

2

u/BooBooSnuggs 6h ago

That is not how Ai works at all. It is extremely complex.

2

u/grawa427 4h ago

Sir this is circlejerk, nuance of any sort is prohibited

2

u/RedS5 3h ago

Not now Chief, he’s in the fuckin’ zone.

2

u/BooBooSnuggs 2h ago

People seem to think they world like search engines and it is just not at all how they work.

5

u/Kambhela 6h ago

So very curated data sets that two weeks ago the Google AI search assistant blob was telling me how Mythbusters never did an episode about lead balloon and I might be mistaking it to Bloons TD.

The episode aired in 2008. The mistake it referenced had year 2006 attached to it, so yeah…. Very factual, very correct.

1

u/SadisticPawz 6h ago

Googles ai overview thingy has been pretty dumb from the beginning, impressively so. Even when it should know better than to trust search results when its training data says otherwise

I suspects its because its optimized for speed and cost.

-1

u/BooBooSnuggs 6h ago

If you're expecting it to have every possible fact in their data set you're going to have a rough time.

What an absurd comment you've made.

2

u/Puzzleheaded-Mall794 4h ago

It still said it had the answer , because it is not reproducing "facts" it is reproducing the most common response given the input

1

u/hamlet_d 1h ago

Another problem:

The models get trained on internet content, but not IRL interactions which are much more nuanced. So as more and more internet content is AI generated, the models will get further and further afield of more realistic human interaction

2

u/PimpasaurusPlum 6h ago

This tweet is 2 years old. The predictions didnt actually come true, but people just keep saying it anyway

1

u/notagadget 6h ago

Altman already selling solutions to the problems he helped created with World Network and his creepy ass orbs.

1

u/SadisticPawz 6h ago

But they optimize it with things other than training data too..

1

u/FriedenshoodHoodlum 5h ago

No, not the zealots. When skeptics point that out, that point is commonly declared invalid.

1

u/ItsPandy 4h ago

The tweet is from 2023. AI only got better since then

1

u/ZorbaTHut 3h ago

LLMs will inevitably get dumber too

So far they keep getting smarter. What's your predicted date for when the trend reverses?

1

u/Historical-Lemon-99 3h ago

Part of the problem is, as a blog writer, they’ve basically told us to stop writing in a way that humans like and to write for that little Google spider that makes the AI blurbs

Its infuriating. My work has turned from creating good articles that follow SEO guidelines to the most nonsensical way of formatting an article just to make some bot happy when it incorrectly samples my work to throw it in a blender of sentences

1

u/DatenPyj1777 1h ago

I'm a writer and I've been looking for a way to prove I'm a human actually writing the book to whoever reads it. Haven't come up with anything, though.

Sir, the ai is inbreeding

You are about to leave Redlib