Mind Readings: Build Your Own Generative AI Benchmark Tests

Written by

In today’s episode, you’ll discover the importance of creating your own benchmarks to test the true capabilities of AI tools. You’ll learn about the limitations of synthetic benchmarks and why they may not reflect real-world performance. I’ll share two of my own go-to benchmarks, one for voice isolation and another for coding, to illustrate how you can develop your own tests. You’ll gain valuable insights to help you make informed decisions when evaluating AI solutions for your specific needs.

Mind Readings: Build Your Own Generative AI Benchmark Tests

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher Penn: In today’s episode, let’s talk about benchmarking and AI and knowing whether an AI tool is capable of meeting the hype about it. What are your go-to benchmarks when you want to put an AI service to the test?

Here’s the thing: every time an AI company releases something, they claim it’s state-of-the-art. We all kind of nod like, “Yeah, yeah, state-of-the-art. Good, good job.” But we don’t have an agreed-upon set of metrics about what constitutes state-of-the-art. There’s a ton of synthetic benchmarks in AI. You’ll hear terms like MMLU and the LSAT test and human preference and all sorts of different synthetic benchmarks that people use to test AI models.

But these tests have a lot of problems, one of which is the models have learned the tests themselves. And so they’re really good at testing well, but they don’t necessarily adapt outside that—kind of like an academic genius who doesn’t do well in the real world. Another major problem with synthetic benchmarks is the benchmark may test for things you don’t care about, or things you don’t do. If you want a tool that reads your email and replies to it appropriately, that’s a real-world test that has very specific parameters. But synthetic model tests—they’re not going to measure that.

So, one of the useful practices you should have on hand are your own benchmarks for how well a model or a service or a vendor can do.

Earlier, not too long ago, someone dropped a pitch in my DMs about—they were saying this is the ultimate voice isolation AI. This AI can isolate any voice from its background and present it in studio quality. Many, many products have made this claim over the years, and none of them have lived up to their promises. None of them have gotten even close.

So, I have a benchmark test for this. This is the first test I’ve ever done. It’s a test that is a piece of video. It’s a short interview with an actress, Katie McGrath, from shows like Supergirl. She did an interview at San Diego Comic-Con. The interview—it was clearly not done by professionals. It was done by fans, which is great for this test because the interview is filled with background noise. And critically, it’s filled with background noise of other human voices.

And the question is, how do you do that? And the answer is you can do that by using noise removal mechanisms, or noise generative mechanisms, filter on non-speech frequencies. So, they can take out a jackhammer in the background, because a jackhammer and human voice are very different frequencies. Or they’ll extract speech frequencies and pass them through a generative model and essentially reconstruct the voice. But with this interview, there’s no way to do that.

In fact, let me play a clip of it.

[Soundbite plays]

I guess heroes and villains are heroes’ redemption.

I have tested this clip against every vendor that says they’ve got state-of-the-art, amazing quality. None of them—not a single AI tool, not a single sound clean tool has ever made this interview studio quality. It has always come out sounding garbled and garbage because it’s a really difficult task. And so that’s a—that’s a great benchmark. Our tools are getting better, but this particular use case, not really.

And so this is my—this is my gold standard. If you have a tool that you claim is state-of-the-art, can do a perfect job isolating a voice, this is the test. If you can clean this up, and truly make Katie McGrath’s voice sound studio quality with no priming, and no reference data, then you’ve got a winner.

Another test I use is for coding. In the R programming language, there is—there’s a library called httr that for years and years was the gold standard for doing web requests inside of R—particular type of function. About three years ago, Hadley Wickham and the Tidyverse crew, who are amazing contributors, they are language—did a ground-up rewrite of it—new library calling it httr2.

Now, a model’s level of sophistication in coding is whether it knows whether to use httr or httr2. Their—their function calls are similar, but not the same. And this was released three years ago, so it’s not new information. So, it—this is a test of a model when I’m coding, and I ask a model, “Hey, help me do some web requests in R,” to see which library it uses. Is it smart enough to know that httr2 supersedes httr, and you shouldn’t use the old one anymore? The reason why models have trouble with this is because the vast majority of older code on the web, like on Stack Exchange and stuff, is in the old format. And so a model that knows to prefer httr2 understands not only code, but understands the age of code, and the logic and the reason—the sensibility of using newer libraries. Older models, they don’t know that, or less—less skillful models don’t know that.

And that’s a really helpful test just to understand how smart is this model.

In Python, there’s a—there was a newsletter—there’s an application, or there’s a package called Newspaper3k. The maintainer stopped maintaining it two and a half, three years ago, and there’s a new fork of it called Newspaper4k. Now, if you’re a human programmer, you would go to the Newspaper3k package, if you saw it, and say, “Hey, this package is no longer maintained, but someone else has taken up and forked it and started a new version over here.” Then you would know, as a human, “I’m going to go over there to the new one.” If a language model understands that, then it shows that it has some reasoning. And I’ll tell you, as of right now, of all the state-of-the-art models that are in existence they use for coding in Python, none of them know this. They all are still relying on the older one.

So, those are two examples of benchmark tests. What are your benchmark tests that you use to evaluate AI solutions for your specific use cases? What are the things that you use to stump AI that, defy and, and maybe bring down to reality, some of the claims made by different AI tools and vendors?

If you don’t have that list, it’s a good time to build it. In fact, one of the best times to build is before you issue an RFP. And in the RFP to say, “Vendors will be evaluated based on a series of tests,” but you don’t tell them what the tests are, because you don’t want them teaching to the test. But a set of objective tests like that can really help you understand what the capabilities of a model actually are and where they can solve your problems.

So, I hope you build that list for yourself. That’s going to do it for today’s episode. Thanks for tuning in. I’ll talk to you soon.

If you enjoyed this video, please hit the like button, subscribe to my channel if you haven’t already. And if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.

Mind Readings: Build Your Own Generative AI Benchmark Tests

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

More posts

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 3

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 2

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 1

Almost Timely News: 🗞️ How To Get Started with Local AI Models (2025-04-20)

Pin It on Pinterest