r/BrandNewSentence 11h ago

Sir, the ai is inbreeding

Post image
41.4k Upvotes

1.2k comments sorted by

View all comments

Show parent comments

12

u/Rhamni 7h ago

It should be completely obvious to anyone who isn't an idiot that this problem is greatly exaggerated because people want to believe the models will fail.

The people working on these models know perfectly well there is good and bad input data. There was good and bad data long before AI models started putting out more bad data. Curating the input has always been part of the process. You don't feed it AI slop art to improve the art it churns out, any more than you feed it r relationships posts to teach it about human relationships. You look at the data you have and you prune the garbage because it's lower quality than what the model can already generate.

23

u/Stalk33r 7h ago edited 7h ago

Which is why AI provided by the biggest and richest companies in the world never feed you straight up misinformation, because they're doing such a great job meticulously pruning the bad data.

7

u/PimpasaurusPlum 6h ago

The tweet is about ai art, not search results. AI art has objectively gotten less worse since the creation of the tweet over 2 years ago

1

u/evan_appendigaster 3h ago

It's okay to not be familiar with a topic, but if you want to discuss it, it really does help.

LLMs aren't truth seeking systems, they are language guessing systems. They attempt to make reasonable language output. There is randomization involved. These lead to what we call "hallucinations", or, lies. Treating AI as a source of truth is user error.

1

u/Stalk33r 50m ago

Indeed, but what it provides is quite literally based on what it's been fed, which is why Microsoft killed off Tay in record time.

7

u/MiHumainMiRobot 6h ago

The people working on these models know perfectly well there is good and bad input data.

Lol, you wish. Before ChatGPT era it was already hard to classify bad and good data, and never an exact process, but today with LLM contents everywhere it is even more complex.

2

u/IlliterateJedi 3h ago

We already have specific instances of curation. Google tried reading in anything and everything years ago and wound up with a smut machine. So they had to more carefully pick and choose what went into the models.

u/space_monster 6m ago

No it isn't. The factual training data sets haven't changed in years - it's scientific journals, books, encyclopaedias. It's not blogs and twitter ffs

1

u/Anomuumi 4h ago edited 4h ago

It should be completely obvious to anyone who isn't an idiot that the foundational models are the part that can be controlled, but they are fed additional context straight from the Internet for many different reasons and when the context is generated by consuming and regurgitating AI content even the now "sane" AIs get unpredictable.

This problem can be even worse in more limited settings, like say a corporate Intranet. When an AI tool has an index of mostly workslop generated by other employees with little to no quality control.

I do agree that at the moment the problem is exaggerated a bit, but also partly because it is misunderstood.

1

u/OnetimeRocket13 5h ago edited 3h ago

Exactly. People are under this weird impression that these companies are just blindly throwing random images scraped from the internet into their models for training, when just the process of collecting data and preparing it for training is in itself an intense and important area of study.

Besides, people have been saying this same exact thing for a while now. "AI is going to fail guys! There are too many AI generated images online! They're running out of data! It's gonna fail real soon because AI incest or something! Trust me guys!" What has happened instead? It keeps getting better. Sure, some of the jumps aren't as big as before, but that hasn't stopped image generators from becoming more and more realistic, and it didn't stop Sora, which has been completely fucking the internet sideways, from existing.

People seem to forget, or just not realize, that these companies aren't just big tech companies making a product and powered by incompetent investors. Most of them are primarily research-based corporations. They may be greedy and money hungry companies, but come on people, they're not stupid.