Fireside Chat: Music and AI, Part 1: Defining Music in the Age of AI

Written by

AI, Artificial Intelligence, Generative AI, Music

Fireside Chat: Music and AI, Part 1: Defining Music in the Age of AI

In today’s episode, we’re diving deep into the world of music and AI. You’ll join me and composer Ruby King for a fascinating discussion about what music is in the age of artificial intelligence. You’ll learn what makes music “good” and explore the capabilities of AI in creating it. Tune in to discover how AI is changing the landscape of music composition and what the future holds for this exciting intersection of technology and art!

Fireside Chat: Music and AI, Part 1: Defining Music in the Age of AI

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher S. Penn – 00:01
In this five-part series on music and AI, filmed at the 2024 Marketing AI conference in Macon, I sit down with composer Ruby King to discuss AI music and the future of creativity.

Christopher S. Penn – 00:12
This is part one, defining music in the age of AI.

Christopher S. Penn – 00:17
Alright, so who are you and what do you do?

Ruby King – 00:21
Hello, I am Ruby. I’ve just graduated studying music and music with psychology at the University of Southampton. I specialized in composition and writing music. I play violin, viola, piano, and I sing, although I mostly focus on composing music.

Christopher S. Penn – 00:41
Okay, so I’m sure you’ve seen and heard that generative AI can simulate music. Let me start with this: how do you know what is good music and what is not?

Ruby King – 01:00
The trouble with that question is it’s so subjective that you always come down to the answer that is really very annoying: it depends. It depends on who you are as a person because what I like as music is very different to what you like as music. To be fair, it’s well-covered, but that’s not because what you listen to is bad music. It’s just not the music that I choose to listen to on a regular basis. It stresses me out. I like to listen to slightly more calming things, but that’s not because when I’m listening to yours, I’m thinking, “This is terrible music.”

So it really heavily depends. I suppose when I’m listening, I’m sort of looking for something that makes me want to listen to it — so originality, creativity. If there’s something in a piece that I don’t particularly find terribly interesting, but then suddenly there’s a key change or something that happens, then usually that makes — in your brain, you sort of think — this is more interesting, this is better. So there are lots of different things that can contribute to being good music, but there’s no way to actually say, “This is good and that is bad,” and anyone that tells you otherwise has a very high opinion of themselves.

Christopher S. Penn – 02:17
Okay, well then let’s get even more elementary. What is music?

Ruby King – 02:27
That is such a broad, terrible question that the answer is always, “I’d rather be answering deep philosophical questions than what is music?” Because it means so many different things to different people and different cultures. We can get so bogged down in the western world — “Oh, it’s only music if it’s organized sound in a set way that uses this sort of set scale.” But then you’re completely ignoring other cultures where, when you listen to it, it is absolutely music, and it’s not for us to say it’s not music.

We’re kind of trying to define it by the set rules that we have sort of told ourselves it has to be. So, music is whatever you want it to be. Okay. Just the easiest way to answer it.

Christopher S. Penn – 03:06
Okay, that’s fair. We should probably turn the lights on. Would help. And let’s turn on this one, too, because we have it. We brought it, we may as well use it. You can turn on a party mode.

Ruby King – 03:25
Please don’t.

Christopher S. Penn – 03:26
There we go. That’s better. Yeah. Alright. Look at that lighting. So when it comes to AI, the way that today’s models work — particularly services like Suno and Mubert and stuff like that, and Jukebox — we have more primitive services like AIVA. They are all using a type of model called transformers. What transformers do is they take in a lot of examples and try to say, “Okay, well, what is the next likely thing going to be based on everything that’s occurred so far?” So if you have a sentence, “I pledge allegiance to the,” the next highest probability word in that sentence is going to be “flag”.

It’s unlikely to be “rutabaga”. “Rutabaga” would be an interesting sentence for sure, but it’s not the highest probability by a long shot. And so, when model makers train these models, they basically take in huge amounts of data — typically from YouTube — and they say, “This is what a pop song is, this is what a rap song is, this is what a country music song is.” And therefore, when a user requests a pop song about this, it’s going to go into its knowledge, try to say, “Okay, well, these are the conditions that the user set up — tempo, or major key, or piano and strings — associate those words with the training data it’s had, and then try to assemble what it thinks that would be.” Typically, these services will do this behind the scenes in 32nd increments, and then they stitch it all together.

When we listen to a piece of this synthetic music, it is all high probability, which means that, absent the ability to tune the model yourself, you kind of just have to deal with what you’re given. So let’s listen to an example of a piece of music. This is one that is from — I attempted to make something with the lyrics in Google’s Gemini first, and then use Suno to compose. This is it.

Speaker 4 – 05:51
Empty echoes in the night searching for a human touch in a world that I see.

Christopher S. Penn – 06:04
Like a dream.

Speaker 4 – 06:07
Come no one ever found metrics crumble lose the hole empathy story must be told and the warnings gently died.

Christopher S. Penn – 07:29
Okay, so what did you hear there?

Ruby King – 07:32
Okay, well, first of all, it sounded pretty bland. But when it came in with — rocksteady, I think that was it —

Christopher S. Penn – 07:40
Yeah. Okay.

Ruby King – 07:41
— the first time, it’s one of those times when you go, “Oh, something’s changed,” but it’s not in a bad way, because sometimes when something changes, it’s not something that you’re, “Oh darn, that.” But that was — it kind of had, kind of paused, and then it went off. And that is very different to what AI was doing not too long ago because it wasn’t really doing the, “Oh, hello, wait a minute” kind of things. So when I’m listening to that, I’m listening to the things that change. Because if it’s just — this is because it’s got a — I think it’s a four-bar phrase that then repeats, and that’s very typical of music. That’s what you’re told to do. If you have something you want to be the melody, reuse it. If you don’t reuse it, no one’s going to remember it.

And it’s not something we want to listen to if it’s not repeating itself, at least a little bit, usually. Okay, so with that, it is doing what’s expected to quite a high degree. The qualities of the vocals are a —

Speaker 4 – 08:40
— different question.

Ruby King – 08:43
— especially when it was without words. It doesn’t quite know what to do. It’s an interesting experience, but I’m sure it’ll improve, and that’s not quite the point. The drums are very heavy, and I suppose for the genre, it sounded about right. That’s not my specialty, that particular genre. It’s not one I listen too much to either. But when I am listening, it’s generally the things that — okay, it’s set out that it wants to do this, but in what ways is it going to branch out and make this more interesting for the human listener?

Christopher S. Penn – 09:17
What are those things that make it more interesting to a human listener?

Ruby King – 09:19
Okay, so those can be key changes. That can either be a sudden key change or one that’s kind of built up into — both can be satisfying, but it depends how it’s done. Also, if any time signature changes because that can change the feel of the song, and also, usually the rhythm of the words can just give it a different feel. And that can be interesting, but can also be done badly. All things can be done badly, but if it’s done well, it’s satisfying. Rhythm, tempo — if anything, any changes, really, because a lot of AI can be — and some — a lot of human-written things can be — I have set about, “I like these eight bars. I’m going to use these eight bars again, and then I’m told I have to have a bridge.”

So there’s something I’ve written, and then I’m back to this, and this is by the template. So this is good. It might be, but it always depends how it’s done — if there’s any, what kind of creativity you’ve gone for. Have you explored it? Have you had a go at something and decided it didn’t work and gone with something else? Is there some kind of originality where the listener’s going, “Oh, I like this. This is good.” And even if you don’t know what that is, that’s fine, but it’s still something that is there that the composer, or whatever has written it, has done.

Christopher S. Penn – 10:44
Okay, is that music?

Ruby King – 10:46
I would say that’s music.

Christopher S. Penn – 10:47
Okay, is it good music? Is it technically proficient? Because obviously there’s a difference. I don’t like Indian food, but I can differentiate between technically proficient, prepared Indian food and poorly made. Okay, they both taste bad, but they taste bad in different ways.

Christopher S. Penn – 11:05
Yeah.

Ruby King – 11:08
It’s not great, but it is certainly a lot better than when it was sort of, “Oh, it’s rubbish.” It’s now kind of, “Oh, okay, this could be playing, and someone might notice if the singing was done by a human or in a more satisfying way.” Because I have heard better voices than that if the voice — because the thing is, as humans, we are very good at being able to pick out when something sounds human. So even in an orchestral setting, we’re taught that if you’re going to write music for a TV show or something, or just cinematic music, or with an orchestra, if you’re going to write it on Logic Pro with lots of music samples, then in order to make it sound realistic, you need to manually go through and try and change the level of expression that you have — if it’s an expensive enough kit to do that.

And also, if you have just one violin that’s actually recorded live doing the same line as all the other violins, then the slight bit of human error can fool the human ear into thinking the rest of it is also by humans. I always find that really cool.

Christopher S. Penn – 12:24
Interesting.

Ruby King – 12:25
Yeah.

Christopher S. Penn – 12:26
So if you had, say, a stochastic generator in AI, which is basically a randomness engine that intentionally introduced small errors into what the AI produced, it would be more convincing to the human ear.

Ruby King – 12:41
Yeah. So in Logic Pro itself, you have — when you’ve got the drum generator or any kind of thing — you can go into the tempo bit, flux tempo — I can’t remember the exact. I think it’s flux time or something. And there is a setting where you can — I can’t remember if it’s called swing or if it’s — it’s something along on the left-hand side where you can drag it along, and it will just set stuff off ever so slightly from the exact beat it’s meant to be on. Because if you tell it to do the exact beat, it’s correct, but it’s not how a human would play it. Not because we’re terrible at music, it’s —

Christopher S. Penn – 13:15
— just because it’s so precise that it can only be machine-made.

Ruby King – 13:18
Yeah. When you hear a metronome, it’s not someone behind it all —

Christopher S. Penn – 13:21
— going —

Ruby King – 13:24
— it’s a machine, whether that be a mechanical one or a computer doing that for you, and that’s fine. And we use those to stay in time with them, and that’s perfectly fine. But if you want something to be human — when on a violin, it’s more obvious on a violin than it is on a piano, maybe, because on the violin, there’s a lot more slides between notes and things you can do there. Vibrato. Sometimes string scratches, although they’re not always intended, the sound of them makes you think, “Oh, but this is real. This is actually being performed.” Because so much of music in television, especially because there’s barely any budget for the actors, let alone the music, so much of it is just, “Here’s a bunch of stock libraries, do the best you can.”

So by just putting in a little bit of human stuff into it and making the EQ and reverb sound like it’s in a concert hall and ordering the things in a way that you’re used to hearing it, all of these things can contribute to making it sound more human. And I think if AI starts going into trying to actually make it sound like it is human or having the voice sung by someone else, or just little bits changed, it would start sounding a lot less like it’s packaged off the Tesco shelves or Walmart. I don’t know.

Christopher S. Penn – 14:47
It’s interesting, though, because you’re saying in many ways that the imperfections define humanity.

Ruby King – 14:54
Yeah, because we’re not perfect, but music isn’t designed to be perfect. There are so many different things about music, so many different ways that you can do things. When you are writing it, you write it in a way that you enjoy, but it’s not necessarily a way that other people will enjoy. And sometimes you can add imperfections on purpose, and that becomes part of the piece. So long as you say it’s intentional, then you get away with it.

Christopher S. Penn – 15:20
Okay. But a machine can’t just make random mistakes.

Ruby King – 15:28
No.

Christopher S. Penn – 15:30
So how do you make random, not random mistakes?

Ruby King – 15:36
I suppose it helps if you say that the mistakes can’t be huge ones.

Christopher S. Penn – 15:40
Okay.

Ruby King – 15:41
If the singer’s just going way off-key, then I will shut my laptop and throw it across, and I win. But no one wants to hear that. That’s not —

Christopher S. Penn – 15:50
It’s like the people at drunk karaoke — “Wow, that’s awful. Someone should tear gas these people.”

Ruby King – 15:56
Yeah. It’s sort of — within reason. If you go to a concert from an artist you love and they start singing terribly, you aren’t going to be there, “Oh, but it’s the person I love. It’s all great and fine.” You’re going to be, “Why aren’t they putting effort in for the show I’ve come to see? I know they can do better than this. Are they actually just auto-tuning themselves the whole time? I feel really let down.” So there’s — we do have auto-tune, and that is used so much.

Christopher S. Penn – 16:27
Hence why the machine voice sounds like that.

Ruby King – 16:30
The machine sounds worse because, with auto-tune, it’s still a human inputting it. But that kind of sounds like two people put together in a blender screaming. It sounds like there’s two different lines at the same time but still the same person, and it’s just really confusing listening to it. How can — because you’re always trying to — you’re sort of trained to go, “Okay, well, how is this recorded?” And with that, you’re kind of going, “I have no idea how they managed to make that sound unless it was just put through a machine that was intended to make this noise.” So it doesn’t sound realistic at all. But a lot of people don’t care about that.

But then it’s still subconscious that when you hear stuff — there have been psychological studies — when you hear stuff that isn’t right or isn’t human, then you kind of know anyway, whether it matters to you or not. It’s still in the back of your mind that you can kind of recognize that.

Christopher S. Penn – 17:30
Okay, that concludes part one.

Christopher S. Penn – 17:33
Thanks for tuning in, and stay tuned for part two. See you on the next one! If you enjoyed this video, please hit the like button. Subscribe to my channel if you haven’t already. And if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.

Fireside Chat: Music and AI, Part 1: Defining Music in the Age of AI

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

More posts

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 5

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 3

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 2

Mind Readings: Never Think Alone, AI as a Thought Partner, Part 1

Pin It on Pinterest