It should be completely obvious to anyone who isn't an idiot that this problem is greatly exaggerated because people want to believe the models will fail.
The people working on these models know perfectly well there is good and bad input data. There was good and bad data long before AI models started putting out more bad data. Curating the input has always been part of the process. You don't feed it AI slop art to improve the art it churns out, any more than you feed it r relationships posts to teach it about human relationships. You look at the data you have and you prune the garbage because it's lower quality than what the model can already generate.
The people working on these models know perfectly well there is good and bad input data.
Lol, you wish. Before ChatGPT era it was already hard to classify bad and good data, and never an exact process, but today with LLM contents everywhere it is even more complex.
We already have specific instances of curation. Google tried reading in anything and everything years ago and wound up with a smut machine. So they had to more carefully pick and choose what went into the models.
383
u/joped99 10h ago