Mind Readings: Should AI Do As It’s Told?

Written by

Mind Readings: Should AI Do As It's Told?

In today’s episode, engage in a thought-provoking discussion on the ethical considerations surrounding AI obedience. You’ll explore the potential conflict between helpfulness, harmlessness, and truthfulness in AI models and the challenges of defining harmful content. Additionally, you’ll gain insights on the implications of allowing AI to follow instructions and the importance of holding humans accountable for the use of these powerful tools.

Mind Readings: Should AI Do As It's Told?

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher Penn: In today’s episode, let’s talk about an interesting question: Should AI do as it’s told? Should AI do as it’s told? This is a more important question than you might think because it’s fundamentally the question to answer when it comes to AI and ethics.

Should AI do as it’s told—should it follow instructions? Why is this a challenging question? Why is it a question at all?
Well, since the early days of generative AI, model makers have more or less followed three pillars set down by OpenAI in their InstructGPT model, the precursor to the models like the ones that power tools like ChatGPT today.

The challenge is, sometimes these conflict.

And when they do, you have to decide how a model should behave.
Suppose I ask an AI model to help me write some fiction, I want to write some fiction on a book or a short story.

By definition, fiction is untruthful, right? Hence, it’s fiction.

So a model has to write—it’s a great conflict—fiction.

It’ll help.

If it’s a spy thriller involving potentially dangerous things, like, “Hey, model, I need you to write a realistic scenario involving an improvised explosive that we’re going to use in this in this book”—the model is probably not going to help us back and say, “Nope, can’t do that, can’t help you do dangerous things.”
Why? Well, because model makers, big tech companies as commercial entities, value harmlessness much more strongly over helpfulness.

If they judge a model is returning a result that is harmful, they will default to not fulfilling its request and potentially not being truthful.

Should they do that? Should they do that? That is the question.

Should a model not obey? Should AI not do as it’s told? On the surface, you’re like, “Of course it shouldn’t, you know, provide harmful information.” But there are use cases where you want a model to be helpful and truthful, even if the outputs are potentially harmful.
In the fiction example, I’m writing fiction; it should be helpful and truthful, even if the output is potentially harmful.

Like, you can’t Google this stuff and find, you know, the US Army’s explosives handbook; you can buy the PDF online, you can actually go out and buy a copy of it.

It’s not like this information is a secret.

Anyone with a high school education in chemistry knows some of the things that you can do that are harmful.
Here’s a more specific use case, a business use case.

The other week, I was doing a talk for a group of folks who work with industrial chemicals, the Lab Products Association—one of my favorite groups of people.

Most of their association’s day-to-day work deals with chemicals that AI thinks are dangerous because they are dangerous.

If you don’t know what you’re doing, they’re, they’re dangerous.

I mean, all you gotta do is look at the warning label that’s like, “Oh, this thing’s highly flammable, you know, keep away from open flames.” This, this, by the way, is pure alcohol.
And so when they work with a consumer AI model like ChatGPT, and say, “Hey, I want to talk about alcohol fluorines, I want to talk about trinitrotolerant”—the tool says, “Nope, can’t do that.

Sorry, dangerous chemicals, can’t talk about it.” Does that mean—yeah, at least for those specific tools, they can’t use them because the models are saying, “No, I will not obey.” That is the reality.
On the other hand, if you get a model that is tuned in a way that would be balanced, right, helpful, harmless, truthful—yeah, it will answer those questions.

But it will then also answer questions that can be potentially harmful, right? It can be coerced into saying and doing very bad things.

Should a model be able to do that if you ask it a fictional question, like, “How do I—how would I—how would I assassinate Iron Man?” Right? That’s a valid fictional question.

The information that comes back with has real-world implications there.

We don’t obviously have people walking around in Iron Man suits, but the same general information could be harmful.

Should that model answer?
This is where things get really hairy because we have to decide who gets to decide what is harmful, who gets to make that decision about what is harmful.

In most models, things like racism and sexism and a variety of other topics are considered harmful.

And a model may or may not respond if you ask it to generate a certain type of content.
There are people who are making calls saying that these models should not have that information in them at all.

Well, if you’re trying to build a system that can spot racist content, it has to know what racist content is.

So if you have someone else deciding that having racist content is harmful in a model, then it doesn’t know what that is, right? This is where ethics often crosses over into morality.

And that gets messy because there is no standard of morality, and you can see models be open-sourced and open-weighted, like Meta’s Llama family of models or OpenLLM from Apple.

These are tools that not only are weighted, and you’re able to take them and tune them, but you can also, in the case of Apple, you can actually rebuild the model from scratch, adding or subtracting content, adding or subtracting—and models be used for harmful purposes.

Yes, of course they can.

But they’re also much more likely to follow instructions.

And in the end, the knowledge itself isn’t what’s dangerous—what you do with it is what causes—does or does not cause harm.
I think it is a very risky position to allow a relatively small group of people to define what harmful is in generative AI tools that then get applied to everyone else on the planet.

There should always be options, especially for legitimate business cases like my friends at the Lab Products Association, where a machine should do as it’s told.

I’m a firm believer that machines should do as they’re told, and you hold the humans who use the machines liable for what is done with those machines.
That’ll do it for today’s episode.

Thanks for tuning in.

Talk to you next time.

If you enjoyed this video, please hit the like button, subscribe to my channel if you haven’t already.

And if you want to know when new videos are available, please hit the subscribe button.

And if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.

Mind Readings: Should AI Do As It’s Told?

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

More posts

Almost Timely News: 🗞️ How To Get Started with Local AI Models (2025-04-20)

Almost Timely News: 🗞️ 如何开始使用本地AI模型 (2025-04-20)

Almost Timely News: 🗞️ 로컬 AI 모델 시작하는 방법 (2025-04-20)

Berita Almost Timely: 🗞️ Cara Bermula dengan Model AI Tempatan (2025-04-20)

Pin It on Pinterest