Mind Readings: Why AI Content Detection is a Lost Cause

Written by

AI, Artificial Intelligence, Generative AI

Mind Readings: Why AI Content Detection is a Lost Cause

In today’s episode, we’re diving into the world of AI content detection and why it’s ultimately a losing battle. You’ll discover the concept of “computational asymmetry” and how it gives AI content creators a significant advantage. I’ll also reveal the three major challenges facing anyone trying to detect AI-generated content and explain why it’s only going to get harder. Tune in to prepare for the future of content creation in an AI-driven world.

Mind Readings: Why AI Content Detection is a Lost Cause

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

In today’s episode, let’s talk about AI content detection. A lot of people are very focused on this, trying to say, “Can we detect the use of AI in content that’s been created? Should social networks be detecting the use of AI in synthetic content?”

Here’s what I’ll say on this front from a technology perspective. Now, not from a legal or ethical perspective, because those are not my areas of expertise. I am a technologist. From a technology perspective, AI content detection is a losing battle.

Okay, why? It’s because of compute—computational asymmetry. People and companies are racing to detect AI content—from schools trying to figure out a term paper has been written with ChatGPT to Instagram, seeing if they can detect AI content. If an image was machine-made.

Computational asymmetry is going to be the gating factor that prevents AI content detection from working really well. What does that mean? Let’s talk about just some basic numbers. As of last year, according to Domo—Domo publishes the “Data Never Sleeps” study that comes out every year of one minute on the internet—of the internet in 60 seconds, what happens on the internet.

You put that in Instagram’s pipeline, it would be immediately backlogged by 1,195 photos in the first second of its operation. And, by the end of the day, you’ve got millions and millions and millions of things that are backlogged that you just will never get to.

So, the code that you run to detect AI software has to be lightning fast. It also has to be cheap to run as partly because it has to be lightning fast, computation has to be super cheap to run because the more overhead your code consumes analyzing images or videos or music, the slower the rest of your services run because you’re burning up CPUs and GPUs in your data center trying to keep up with the endless deluge of imagery.

We all know the adage, right? Fast, cheap, good—choose two. We know the code to detect AI-generated content, by definition, has to be fast, has to be cheap, because it has to scale so big, which means it’s not going to be very good. In fact, most of the code that detects AI-generated content tends to be dumb as a bag of hammers because of the speed and cost constraints.

It’s an asymmetry problem. I, as a user of any AI creation tool, I can wait the five seconds or 15 seconds or 50 seconds for a really good model to build a really good image. And, because there’s millions and millions and millions of these users, they can create images with AI faster than software can detect it.

If we’re all uploading millions of AI-generated images a day—and that’s with today’s stuff. This does not take into account the evolution of these models. Stable Diffusion three is coming out very, very soon. I believe they’re releasing the weights sometime in June 2024.

That model, when you look at the outputs, is substantially better and substantially more realistic than its predecessors. It’s got the right number of figures on the hand. But, more than anything, when you look at the images it creates, they look pretty good. There’s still things that are wrong, but there’s fewer and fewer of them with every generation of these.

Think about the text generation models: the new versions of ChatGPT and Google Gemini and Anthropic’s Claude are far better than their predecessors were even six months ago, much less a year ago. A year ago—it’s June as I record this, June 24th—and June 2023, ChatGPT answers were not great. GPT-4 had just come out, and most people were using 3.5 because it was it was the free version. It sucked. I mean, it still sucks. It does an okay job of, like, classification, summarization, but it’s not a great writer. Today, a year later, the new four Omni model that everyone can use—free and paid—much, much better.

So, this poses three three big challenges when it comes to AI content detection.

Number one, human stuff—human-made stuff—is going to get flagged more, especially as these models improve. Your human content is going to get flagged more and more because these primitive detection algorithms will have a harder time catching up. The models and people—the gap between what we can create and models can create is getting smaller and smaller. And way over here, on the other end of the spectrum are the detection algorithms that, because of cost and speed constraints, can’t catch up nearly as fast.

And so, as this gap closes, these dumb—relatively dumb—tools will be retrained to be slightly less dumb and will make more and more mistakes, saying, “Hey, that’s AI generated,” like, “No, it’s not. That’s actually a picture of an aurora that I took in Boston in 2024.”

And machines—so, this is going to be a problem. AI stuff is going to get flagged less. This is especially true of open-weight models, where the model maker can offer watermarking technology, but users can just remove it for images and things like that. But again, that gap is getting so smaller and smaller, which means that to avoid false positives and blaming a human and saying, “Oh, that’s AI-generated,” it’s not—by definition, the tool then starts to miss more AI-generated things, too.

And, the compliance for AI labeling is going to be impossible for all of the biggest content networks because the compute costs for primitive content detection are going to escalate way beyond affordability. This month, there’s a big discussion about the the art app, Cara. Cara—I don’t know how to pronounce it.

The the creator went from 100,000 users to 500,000 users in the span of a couple of weeks because the app really called on—one of their big things is: no AI, anything.

The large—because of architecture problems and a bunch of other things that went wrong—the creator, who was a person, got a $100,000 server bill. Now, imagine your pet project, something cost you $100,000. But, and this is with—again—relatively primitive detection of AI-generated content. It’s going to get harder and harder for anyone, except people who own massive server farms, to even detect AI content.

So, what does this mean for you? Well, two things. One, if you are your company, you’re— you know, you personally—if you accept user-generated content in any form, whether it’s a contact form on your website, uploads, comments on your blog posts, you can expect to be swamped by AI-generated content if you aren’t already. Even on platforms where there’s no benefit to automation and bots and stuff whatsoever. People show up with AI bots, anyway. Go on to Archive of Our Own, which is a fan fiction site. There is nothing—there’s no commercial benefit there at all, for any reason. There’s no reason to be leaving spam blog—because they can’t create links to sites, there’s no SEO benefit. People run bots there, anyway. Anyone who accepts content from the outside world is going to be getting a lot of—an awful lot of AI of it.

Two, you need to decide your personal, professional, organizational positions on generating and disclosing the use of AI. There isn’t a right answer. Some organizations, it makes sense. Some organizations, it doesn’t. And, and what you disclose above and beyond what’s legally required—that’s up to you. And, there is no right answer as to whether or not you should be using AI to generate stuff and whether you should be disclosing it above and beyond what is legally required.

So, something to think about as you embark on your use of AI.

If you enjoyed this video, please hit the like button. Subscribe to my channel if you haven’t already. And, if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.

Mind Readings: Why AI Content Detection is a Lost Cause

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

More posts

Mind Readings: Why Generative AI is Better at First Drafts

Mind Readings: The Window to Influence Generative AI is Closing

Mind Readings: 4Us of Generative AI Literacy

Mind Readings: New Is the Skill of the Future in an AI World

Pin It on Pinterest