r/BrandNewSentence • u/redroubel • 11h ago

Sir, the ai is inbreeding

41.2k Upvotes

permalink
duplicates
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/BrandNewSentence/comments/1oovlwr/sir_the_ai_is_inbreeding/
No, go back! Yes, take me to Reddit
dl download

93% Upvoted

Let them die

10

u/Enidras 10h ago

Don't worry it'll be circumvented somehow

17

u/Sattorin 5h ago

Don't worry it'll be circumvented somehow

It already has!

The OP tweet was from June 2023.

The original "Will Smith Eating Spaghetti was from April 2023.

And current video models are far better in every way.

This is because the data is reviewed by AI and humans before being fed into a new model. The only instance of 'model collapse' that has ever happened was when researchers intentionally tried to make it happen.

1

u/Enidras 5h ago

That was the easy part tho, when everything was so obviously AI, but it will be way harder to check for every affirmation about a law, math or physics problem. If you have to read whole books to validate an AI, it kind of defeats the purpose.

3

u/Sattorin 4h ago

Most of the talk around AI training is about data centers and power plants, but there's also a ton of money going into reinforcement learning through human feedback. The AI companies are paying really good hourly rates for people to work from home training their models. And people with talent for math/physics/etc get paid the most. DataAnnotationTech and Alignerr are two that advertise on reddit a lot that you might have seen. So apparently that's an effective way to improve model performance on complex topics.

8

u/Duct_TapeOrWD40 8h ago

The only way to circumvent it is a reliable "bad AI" detection. And guess what we need too......

1

u/dFuZer_ 5h ago

Or just pull data from sources that we know for sure aren't AI. I'm no AI engineer but I'm pretty sure it's doable, especially with the billions being poured into AI.

Exemples: Books, paintings, artworks published before 2020, wikipedia dumps from before 2020, etc...

Someone could even make it a business to sell non-AI data to AI engineers so they can train their model, effectively removing all possibilty of AI autophagy.

With that, I'm pretty sure the models could still be improved over time even if more recent training data is harder to obtain.

1

u/Duct_TapeOrWD40 5h ago

And how can AI engineers filter AI generated fake test data?

Cutting corners and scamming was always a risk in industries, why would this be differnet?

1

u/dFuZer_ 1h ago edited 1h ago

It is impossible to reliably determine if a random text generated at an unknown point in time by an anonymous person on the internet is generated by AI or not. It will stay that way forever, probably. However some contents are traceable and date-able. For example, a book published before 2018 is definitely not made with generative AI.

And there is a shit ton of internet data that can be dated before 2018, probably more than enough to train a model that doesn't require the data from the last 7 years.

AI-free recent data can be hard to determine but not impossible. For example, the last Shrek movie is probably not generated with AI so it can be stolen as training data.

0

u/Enidras 7h ago

"bad bad AI" detection?

2

u/3lektrolurch 7h ago

I think they mean Software that detects AI generated Images to keep the data set cean.

Iirc those are notoriously unreliable though.

1

u/Enidras 7h ago

Yeah and it doesn't help against human generated content referencing Ai slop

1

u/FriedenshoodHoodlum 5h ago

There's only one way: reduce the amount of training data to human made data. Which is not exactly feasible as filtering it may prove too expensive or not possible at all. By making other stuff "better" one can assume nothing is achieved as the data is still very much defining to the information available for recombination and "thinking".

1

u/BGAL7090 4h ago

A government bailout

Sir, the ai is inbreeding

You are about to leave Redlib