Fireside Chat: Music and AI, Part 2: The Human Touch: Imperfections and Intent

In today’s episode, we’re continuing our exploration of music and AI with composer Ruby King. You’ll listen to more AI-generated compositions and uncover the surprising ways AI interprets and replicates musical styles. We’ll discuss the concept of “human touch” in music and how AI attempts to capture those subtle nuances. Tune in to discover the challenges and opportunities AI presents to the world of music composition!

Fireside Chat: Music and AI, Part 2: The Human Touch: Imperfections and Intent

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Christopher S. Penn – 00:00
In this five-part series on music and AI, filmed at the 2024 Marketing AI Conference, or MACON, I sit down with composer Ruby King to discuss AI music and the future of creativity. This is part two, the human touch, imperfections, and intent. Let’s try another piece. We’re going to switch to something theoretically a Viennese waltz.

Ruby King – 00:21
We’ll find out if it actually is or not. That’s okay.

Christopher S. Penn – 01:58
So what was that? How was that?

Ruby King – 02:00
It was very interesting. It was quite nice.

Christopher S. Penn – 02:03
Okay.

Ruby King – 02:04
It was quite nice. The first thing I was noticing was — again, sort of going into “How was this recorded if it was by humans?” — and it sounded like — I could have been wrong because I’m sat away from it — but it sounded like there’s a bit of crackle on the recording.

Christopher S. Penn – 02:16
Interesting.

Ruby King – 02:16
A little bit. I was kind of wondering — it sounded like a recording that my old violin teacher would give me from her cassettes. It was, “This is what the piece you are playing,” and it was recorded a very long time ago in black and white. It’s just the EQ as well. The quality of the sound — it didn’t sound like it was recorded. It didn’t sound like I was in the room with it, and that’s not a bad thing. It’s just — it sounds like older recordings. So I’m wondering if maybe the info it was fed was from past —

Christopher S. Penn – 02:48
— highly likely, yeah.

Ruby King – 02:49
Okay. As to whether it’s a Viennese waltz, I can’t answer that because I don’t have — I can’t be sure enough. I’m not an expert, and someone will shoot me down online if I say one or the other things. I am going to back out of that one slowly.

Christopher S. Penn – 03:02
Okay.

Ruby King – 03:03
But it started faster, and it got — it had some really slow bits in it. That was nice. I was surprised by how many different sections there were that seemed to be trying to do a theme and variation. I’m not completely sure it did do a theme and variation because I think it might have forgotten exactly what it did or the music wasn’t memorable enough to remember. I would have to look at the actual notes themselves and listen to it quite a number of times more. But it sounded like it was attempting to do a theme of variation.

I think waltzes are usually three, four — goes a 1-2-3, 1-2-3. And most of that was, I think one bit seemed to be six, eight, six, eight, or six — my music teacher’s going to kill me — but sort of, where it’s longer: 1-2-3-4-5-6, 1-2-3-4-5-6. So it’s still the same feel, but the phrasing doesn’t stop midway through the bar. It ends at the end, if that makes any sense.

Christopher S. Penn – 04:02
No, it makes total sense.

Ruby King – 04:03
Okay.

Christopher S. Penn – 04:04
But it calls back to how the Transformers model works, where it’s doing things in chunks. Instead of sewing it back together, one of the things that you’ll notice with tools like Suno, particularly for longer compositions, they lose coherence, two and a half, three and a half minutes into the song. Sometimes they’ll just go totally off the rails, and you’re not sure why. But then when you look underneath the hood, “Oh, it’s because it’s doing 32nd chunks, and it either forgot, or there was a token that got misplaced in chunk 13 that just throws the whole thing off.”

Ruby King – 04:39
Yeah, that was — that was enjoyable listening. But another thing that — I’ve done violin for quite a number of years, and one of the key things I was listening to there was, “Is this a real human playing it? Is this playable?” Because one of the main things you can hear in music that’s violin or viola or whatever lead is the bowings. Most people don’t bother about this, which is why I’m insufferable to watch TV with, an Umbrella Academy. When they whip out the violin, I’m, “No!” Because it was a bit loud. Sorry. It was —

Christopher S. Penn – 05:17
— or Sherlock, where he’s not — what he’s doing, the hand motions don’t match the actual audio.

Ruby King – 05:22
It’s just so painful. Just get someone who can, please. We exist. Just the fingers — all the editors mess it up afterwards. I don’t blame them, but, okay, anyways. One of the main things is bowing, and you can hear it because if it’s up and down, you can hear how the string — it sort of — it changes. There’s a break in between the noise. If you’re slurring, which means going from one note to the other with the same bow, it’s only the note that changes. There’s not really a pause in the middle. So most of that was separate bowings, especially when it was doing quite fast jumps. At one point, it was — I pulled a face at one of — one of those face pullings was because it was doing something quite fast with a lot of jumping.

Ruby King – 06:10
Their right hand must have been going — it is very possibly possible, but that player deserves an award.

Christopher S. Penn – 06:21
Probably some aspirin.

Ruby King – 06:24
The way it would be chosen to play it — if the composers specified that would be how you should do it, then you would probably try and do it like that. But a violinist would naturally try and not do that much effort because it doesn’t sound right being so separate either. If it was more together, just in phrases, a few notes in the bar, maybe just — if it’s in three, four, then maybe six of those could have been in one bow and then another six in another bow, and that would have still kept the phrasing. But just the way that it’s performed, it’s not thinking about performance rules. It’s just thinking, “These are the notes, and this is a violin sound. Therefore, this is correct.”

Christopher S. Penn – 07:05
Right. So AI is essentially synthesizing from the outcome of the data but does not understand the techniques used to make the data.

Ruby King – 07:17
Yeah, I think so.

Christopher S. Penn – 07:18
Okay.

Ruby King – 07:19
Because there’s a few times there, I think, in that piece, it would have been nice if there was a slide up because it does do some jumps. The nice thing with the violin is it’s not a piano. I can say that I play both, but on the piano, you can do slides, you can do glissandos, but it’s easier on a violin because you can kind of slide up to a note and add some nice vibrato once you get there. Piano, it’s a bit more — you can’t get all those microtones between. So it’s kind of — because there’s —

Christopher S. Penn – 07:50
— defined intervals between one note. So if you are a content creator using generative AI, the instruments that have more analog between notes, like a violin, as opposed to being more defined segments, it’ll be more of a giveaway that you’re using AI because the machines are not synthesizing the technique needed to create the sound.

Ruby King – 08:13
Yeah. So it is said that the violin, or strings, are the closest thing to the human voice. And the human voice, we can easily tell, most of the time, when it is not a human voice. Okay. Saying that, there’s a lot of speech that is very good now, and you can’t really tell. Those models are very advanced, and it sounds very good. But singing isn’t quite there yet, I’m assuming, because there’s so many different techniques all the way down to breathing and where you hold your head. The sound’s going to be different if you’re looking up to — if you’re looking down. It’s just because there are so many variables.

So the violin and singing — the human voice — are dead giveaways. Well, voice more so than violin, because I think you have to be a bit more trained on violin, usually. But still, it is coming down to the subconscious. When you listen to that, are you thinking, “This is a real performer? I can imagine being sat in a concert hall?” Are you going, “This is a violin. I can’t say anything more about it than that”?

Christopher S. Penn – 09:19
Right. Okay, let’s try a piano piece. So this one is supposedly a ragtime. What’s that sound?

Ruby King – 09:33
Slides in there. It’s gone again. You’d expect the violin to come back in a minute. Just loosen that.

Christopher S. Penn – 10:59
So this is an example where it should have stopped.

Ruby King – 11:05
Unless it’s — ooh. I mean, it should have stopped if that was the intention. But if this was the intention, then that would be an impressive thing for a human to do.

Christopher S. Penn – 11:18
So the prompt on this was just, “Enough beat back then.”

Ruby King – 11:22
It’s just — it liked what it did, and it was, “I’m going to do more before you shut me down. I’m done now.”

Christopher S. Penn – 11:38
Okay, so that was an example, in particular with Suno, where it will have these — they’re almost hallucinations, where it sounds like there’s a logical stop of the song, and it’s, “I’m just going to keep going.”

Ruby King – 11:53
Was it given the time frame it had to be?

Christopher S. Penn – 11:54
No.

Ruby King – 11:55
No. So it just — it was just, “I’m having too much of a fun time here. You cannot stop me.”

Christopher S. Penn – 12:00
If I had to guess, and this is pure speculation, the model had enough fragments of enough tokens leftover to forecast from that it was, “I don’t know if I should stop or not.”

Ruby King – 12:12
Okay. It definitely — it did feel like it came to an end, but it continuing wasn’t necessarily wrong. So it wasn’t right for what you said, but if you were in a concert hall and that was played, and then there was a pause — and sometimes you do that, which is why you’re told, “Do not clap in the middle of a movement because you will look like an idiot, and everyone will stare at you, and the musicians will remember you for the random time you —” true story — then it’s kind of — it sort of felt like a different movement.

Christopher S. Penn – 12:46
Right.

Ruby King – 12:47
It was — the music didn’t necessarily feel connected, but it felt like a nice, different piece. So you might expect, if it was meant to be the same piece, you’d expect the first piece of music to come back again, and that would be sort of a bigger end, and that would be an ABA structure because you sort of have the A, and then the B, then it just ended. It was a nice end, but it wasn’t necessarily the way you’d expect a piece to be. A and B does happen. That is okay, but for the instructions you gave it, you’d expect it to go back to the A section. So it’s not that it’s wrong, it’s just — it’s not — it’s wrong for what you asked it to do.

Ruby King – 13:28
But musically speaking, if you handed that in, it would probably be seen as a good thing that you did something creative, a false end, and then you continued with something that was different, and it was an “Oh!” moment, which is a good thing. So musically, it’s good. Prompt-wise, not so good.

Christopher S. Penn – 13:48
Right. Okay, let’s try one more piece, and then I want to talk about how we can — for people who want to use these tools — how we can get better performance out of them. So this is the last one. This is the ragtime.

Ruby King – 14:20
Good bassline. It’s repeating itself. I remember that — for now. For now. Basic — good — oh, that is a variation of theme A. Oh, that’s definitely theme A. Half speed. Slower. Happy. That is — anyway, I can’t remember theme B well enough, but I remember A — is that key change, or is that just — okay, that’s all right. This is longer than I thought it could be, and it’s remembering itself quite well. You would probably want it to be doing something more exciting with this now because it is just feeling like it’s gone back to the start. The left hand should probably be doing something more interesting. It’s an odd end. I mean, it’s a valid end. It’s not what I would have done, but it — it ends fairly convincingly.

Christopher S. Penn – 16:54
That concludes part two. Thanks for tuning in, and stay tuned for part three. See you on the next one! If you enjoyed this video, please hit the like button. Subscribe to my channel if you haven’t already, and if you want to know when new videos are available, hit the bell button to be notified as soon as new content is live.

Fireside Chat: Music and AI, Part 2: The Human Touch: Imperfections and Intent

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

Pin It on Pinterest