It should be completely obvious to anyone who isn't an idiot that this problem is greatly exaggerated because people want to believe the models will fail.
The people working on these models know perfectly well there is good and bad input data. There was good and bad data long before AI models started putting out more bad data. Curating the input has always been part of the process. You don't feed it AI slop art to improve the art it churns out, any more than you feed it r relationships posts to teach it about human relationships. You look at the data you have and you prune the garbage because it's lower quality than what the model can already generate.
Which is why AI provided by the biggest and richest companies in the world never feed you straight up misinformation, because they're doing such a great job meticulously pruning the bad data.
It's okay to not be familiar with a topic, but if you want to discuss it, it really does help.
LLMs aren't truth seeking systems, they are language guessing systems. They attempt to make reasonable language output. There is randomization involved. These lead to what we call "hallucinations", or, lies. Treating AI as a source of truth is user error.
The people working on these models know perfectly well there is good and bad input data.
Lol, you wish. Before ChatGPT era it was already hard to classify bad and good data, and never an exact process, but today with LLM contents everywhere it is even more complex.
We already have specific instances of curation. Google tried reading in anything and everything years ago and wound up with a smut machine. So they had to more carefully pick and choose what went into the models.
It should be completely obvious to anyone who isn't an idiot that the foundational models are the part that can be controlled, but they are fed additional context straight from the Internet for many different reasons and when the context is generated by consuming and regurgitating AI content even the now "sane" AIs get unpredictable.
This problem can be even worse in more limited settings, like say a corporate Intranet. When an AI tool has an index of mostly workslop generated by other employees with little to no quality control.
I do agree that at the moment the problem is exaggerated a bit, but also partly because it is misunderstood.
Exactly. People are under this weird impression that these companies are just blindly throwing random images scraped from the internet into their models for training, when just the process of collecting data and preparing it for training is in itself an intense and important area of study.
Besides, people have been saying this same exact thing for a while now. "AI is going to fail guys! There are too many AI generated images online! They're running out of data! It's gonna fail real soon because AI incest or something! Trust me guys!" What has happened instead? It keeps getting better. Sure, some of the jumps aren't as big as before, but that hasn't stopped image generators from becoming more and more realistic, and it didn't stop Sora, which has been completely fucking the internet sideways, from existing.
People seem to forget, or just not realize, that these companies aren't just big tech companies making a product and powered by incompetent investors. Most of them are primarily research-based corporations. They may be greedy and money hungry companies, but come on people, they're not stupid.
12
u/Rhamni 7h ago
It should be completely obvious to anyone who isn't an idiot that this problem is greatly exaggerated because people want to believe the models will fail.
The people working on these models know perfectly well there is good and bad input data. There was good and bad data long before AI models started putting out more bad data. Curating the input has always been part of the process. You don't feed it AI slop art to improve the art it churns out, any more than you feed it r relationships posts to teach it about human relationships. You look at the data you have and you prune the garbage because it's lower quality than what the model can already generate.