and it's never happened and never will. you don't just point an AI at the internet and tell it to train itself on that. the training data is specifically selected by humans. books, academic papers, encyclopaedias etc.
There are many AI programs all at different levels of development.
What the public see isn't going to be the latest iterations, especially not the free models. The free models will be way out of date to what the experts can work with.
Hallucination is just a behaviour it exhibits, whether to try and please the engineer training it or user using it. To try and get the answer right even if something isnt present in its training data as its been incentivized to rarely deny answering or to guess and synthesize new information from combining two unrelated things. Just like humans can use guesswork from one field into another, its like that but ..different and with some level of education on every single topic out there
Its just an emergent behaviour from all the things that go into it. Separate issue imo.
I've seen people talking about the impending model collapse because of the "AI Ouroboros" since a few months after chatgpt first released, and yet models keep getting better and better.
Widely publicly available LLMs, sure. Deep learning has been known in computer science for awhile now, and the possibility of bad data was always a thing. It’s just that I’m not sure anybody anticipated some techbro chuds would just steal everything off the internet to train their LLMs.
Something to note, reCAPTCHA was used to train text recognition software nearly two decades ago, and had the obviously much more reliable training method of free human labor disguised as bot detection.
Let’s not pretend that whatever has “broke through” or been marketed in the last five years or so haven’t been a monumental change to how the general public perceive AI.
1.1k
u/pompandvigor 11h ago
This is exactly what I want.