On Emotionally Intelligent AI (with Chris Gagne, Hume AI)

Hey there today, I'm having on Chris Gagné in AI researcher and friend Chris manages AI research at Hume Which just released an expressive text -to -speech model in a super impressive demo They also just announced 50 million dollars in their latest round of funding.
So that's pretty cool Hume is kind of the only company out there focused on AI for emotional understanding, which is pretty cool Chris did his PhD in cognitive neuroscience at UC Berkeley and postdoctoral research at the Max Plink Institute for Biological Cybernetics.
Doesn't that sound cool? I want to talk to Chris about AI and emotions. I want to hear from him about the implications of AI understanding emotion. What's cool about it? What's scary? What are the risks and opportunities?
And I'm going to really press him on whether he thinks that AI can really understand emotion and whether that's a good thing. Chris wants me to say that all he uses going to share are his own, not that of his employer, which is good because it means that he can be real with us.
All right. Let's dive in. Chris, let's start with the easy question. Can AI understand human emotions? I think it's getting there. I think LLMs already have a decent understanding of human emotions.
I think if you ask an LLM, for instance, to read your writing and describe how someone might emotionally react to that, I think it actually can do a pretty good job, especially GPT -4 right now.
It can do a good job at kind of guessing what sort of emotions people might experience with this writing. And I think as they become more multimodal, this understanding is going to grow and it's going to extend beyond this purely linguistic understanding.
But yeah, I think there's some degree of emotional understanding right now. Okay, so obviously this is like controversial, right? A lot of people might turn back to you and say that makes no sense. Emotional understanding for humans is generally something much more like a substrate dependent,
right? You have like emotions in your brain, you're reflected. that you have empathy, you feel an emotion, and that's how you understand it. AI presumably can't feel emotions. So how could it possibly understand emotions? - Yeah,
I think this is a good distinction that we'll probably keep for the full conversation, but I think there's sort of like a linguistic and cognitive understanding of emotions, sort of how can you describe the different emotions?
Can you describe the expressions that are associated with them? Can you give a sense of what might have caused those, what situations in the past, what situations... immediately would have caused those emotions, what might lead to the resolution of those emotions,
all this sort of like high level linguistic understanding. And I think that's quite separate from sort of like vicariously experiencing a particular emotion, the feelings that you might have. I think AI can very much do and will be able to do the first one.
And the experiencing of the emotion in the same way that we experience it, I think is something that, you know, we'll see as AI develops more, you know, awareness or consciousness or whatever we want to. call it. But at the current moment, I think it's very far from that sort of emotional understanding.
So I, you know, I'm on the side of, I think there's a good amount of emotional understanding they can have without this sort of feeling, just, you know, the way that you might sort of write your own experiences down and you might read it later and say, oh,
that was a good emotional understanding of the situation. I mean, I guess there's something weird here then to analyze about humans that we empathize at all towards understanding, right? Like I could see you crying and intellectually.
understand what might be causing it. I mean, that's not really how our brains work, right? Like, is it possible that it's actually more efficient to experience empathy, like to actually have some essentially like simulation of the emotion?
Yeah, I think it might be. And I also think that that's human, at least from like going from childhood to adulthood, I think we learn empathy in that direction where we very much automatically can experience the emotions that other people are experiencing at that same time.
But then later we learn to like attribute the right words to that. and describe the situation in more detail, verbalize it, sort of detach ourselves a little bit from maybe the feeling of emotion. But I think therapists do this really well.
They've detached themselves in some situations from fully experiencing what their patients are feeling, and yet they're able to verbalize this and describe it. And I think AI is going the opposite direction, where it's starting with this more verbal understanding and then maybe we'll see in the far future whether they can have anything sort of like the feelings that we experience.
I mean, why would we possibly want it to have these experiences that like emotional experience? It seems like I would imagine we'd only want that if it were like necessary Like if we found it really hard to get AI to understand emotion without feeling basically.
Yeah, I agree But if anything you could say the same about humans, right? Like I think this point is made in like some books on like the evolutionary psychology side of like Why would we want humans to experience anything like wouldn't it be great if humans could just understand even your own emotion?
Like doesn't kind of suck that when you get angry you you get angry? Like, wouldn't it be awesome evolutionarily speaking if humans could just intellectually understand that they should feel angry and then act rationally and accord us with that?
Like, do you have any thoughts on like, why the hell do humans actually experience emotion? - I don't have any thoughts on that. I mean, other than, I think that's just the way that I think those signals were more primal and sort of what came first in evolution. And then we sort of learned to,
you know, turn those into more verbal thoughts later. - And yes, somehow we're expecting that the opposite is true with AI. That it's-- it's more efficient for an AI to understand intellectual emotion than to do something analogous to experiencing it.
- Yeah, maybe not more efficient, but certainly what it has access to right now, and that the way that AI path is developing, in that it's starting with language, and then going back towards the core modality. - So in terms of like,
I just wanna ask, your research is, if I understand correctly, not on emotion technically, right, it's on prosody? - Yeah, it's on giving text speech models the ability to sound more. more human, sound more expressive,
sound more emotive. That's part of it. So they're also adding that information to language models so that they can take advantage of some of these expressive signals while they're interacting with humans. So is there a difference?
A prosody is a big word that I had never heard of until you explained this to me, but can you just explain what is prosody? Is there actually a meaningful distinction between that and emotion? Should we care? Yeah. I think the prosody is all the external signals,
much more closely related. to the acoustic qualities of the sound, that allow you to refer the emotions. So I think there's a lot of expressive signals that we give off in our faces and our voices. And those are easily observable for most people.
And then we use those to infer the emotional state that the person is experiencing. And so I think at least allowing machines to read at least these external signals that we give off to one another as part of the conversation that conveys so much more information than if we're texting or something like that would allow to visit.
the communication. Okay, so it's like giving a picture, when a person gives a picture into their emotional state that they essentially choose to share or subconsciously choose to share. But generally, I show you I'm happy by smiling and I show you I'm frustrated by maybe making certain sounds or changing the speed with which I talk,
things like that. Yeah, exactly. Yeah, so to go back to Prasad to answer your question more fully, it has things like the speed at which you talk, your imagination, do you, you know, for questions we might rise, the pitch might rise at the end to signify a question,
even if you don't have, you know, which is might be hard otherwise to pick up when someone's speaking and so on. So my imagination is that like there are two ways that you would potentially try to understand emotion, right,
whether you're human or an AI, there's kind of like the more natural way which you know might correspond more with like annotation, like I say that sounds angry, but then there's kind of like I know you're angry or I know you're getting angry because like minutes later you say I'm angry or you start yelling at me or you do something essentially like it becomes far more obvious in the future that you're angry.
So is there a difference between like, you know, the approaches in emotion of kind of like trying to understand the implicit signals that are like your behavior corresponds with anger and therefore you must be angry versus something like you sound angry.
Is there any distinction between it there? What I mean to say is sort of like if I'm right now talking in a very monotone voice and then like a minute later, I go, you know, Chris, I hate you. And I think, you know, the way that you've been talking is horrible.
And generally you're just a horrible person, right? Like, I don't sound angry in terms of my voice. Obviously the content of what I'm saying sounds very angry. And you could probably infer that I am angry, but I definitely don't sound angry when from like a prosely perspective,
like that's kind of the distinction I'm trying to go for. Can you think about like the phenomena separate from like tone and stuff? Yeah, I think so. I mean, I think a lot of what we're also trying to do is separate those two. two components of signals and then bring them back together so that the language model or the ultimate agent can choose to listen to one or the other.
And I think a lot, I mean, Sarcasm is famously like saying something in a different tone than we're actually saying it. So I think, you know, separating the signals and having them stream into another system is important. And I think humans do this all the time.
We can pick up on, you know, you saying something, potentially in an angry voice, even if you're talking about, you know, something totally troll or vice versa. And it might mean something to you. different when you're, it's an interesting scenario of talking about something very angrily,
but not sounding angry. And that conveys something very differently than if you just said voice. So sarcasm captures that like contradiction when I say something in deliberately the wrong tone. Yeah, that's one aspect of it.
I mean, I think, you know, saying something very angry in a neutral tone can be something very interesting. So depending on the context, I think. So it could actually just be something deeper, like it's not even just like anger or lack thereof.
It's like this third thing, which is like, you were speaking angry in a neutral tone. That tells me something on its own. Yeah. Fascinating. I think part of this then like goes into, I don't want to go too far on the sci -fi stuff, but there is like the emotional manipulation thing,
which is like, if we think of emotions as trajectories, if we think of like, a word doesn't have an emotion, a sound doesn't have an emotion. It might emit something else like tones, but really the emotion is me.
It's a human. It can't change that quickly. Which means that really, if you're like, modeling like, "Oh, that sounds like you're angry," you might be modeling you're like pre - angry, right? You might be able to like predict and understand pre -emotions essentially.
Once we start talking about that, it sounds like AI could potentially emotionally manipulate. Like, what are your ethical worries here? Yeah, we're certainly worried about that. I think there is, this doesn't go that far beyond,
I don't think, the just language models and I just having the ability to manipulate people in general, once I think in general. just ties in with the broader alignment research of choosing, making sure that language models agents have the right objective.
There's a discussion we might want to have about like open source versus, you know, proprietary language models to keep sort of the gates on these things. And then I do think as long as, you know, so there is this risk of worrying about giving language models additional tools that they could use to kind of steer the conversation in a way that's misaligned with the objectives that you want them to do.
But I think most of this has to do with choosing the right objectives for the language models. and making sure that they are, you know, as much as possible following those objectives and also monitoring, you know, monitoring their abilities to support those objectives and use things like emotional signals to maybe steer the interaction in one.
- So there is the open source, closed source side. So from there, you're just kind of thinking through like, are there risks that we just want to control in terms of how people choose to use these models? - Yeah,
I think there's a lot of use cases that would be beneficial for society. I think. I think, you know, in your case of AI therapists and other sort of like customer interactions, I think those are, if done in the right way, using emotion and expressive signals to sort of enhance the quality of those interactions.
I think it's useful. Obviously, any sort of applications where there's this risk of deception or manipulation would be quite careful in how we allow these use cases. So I think... What about the other way? Like, how do you feel about humans deceiving AI?
Like, is it going to get harder with this kind of... stuff? I think so. Do you have a particular use case in mind where we would want to deceive AI? I mean, hopefully. So last night I was reading Ready Player 2 the sequel to Ready Player 1 and in the book There's a chapter where they're talking through like the main character goes to an AI therapist They make a really big thing of like he's the therapist from Good
Will Hunting played by Robin Williams. Anyway, it's very funny But one of the things he says is like the therapist asks him something like, you know How have you been doing with whatever? And he doesn't want to talk about it it, so he says, "I'm doing totally fine." And then he writes in the book,
like, his narrative, like, "I was obviously lying." And there's something really interesting about, like, I was thinking about this because it was literally an AI therapist and the whole thing. The author, I guess, hadn't considered the idea that the AI would know he's lying,
right? But humans must lie to their therapists all the time, right? Like, there should be, there's sort of, like, some control that we all get from the ability to lie. And I do wonder if it would have, like, really fundamental ramifications if we no longer could lie.
lie Because those who we're interacting with would actually know like do we almost like lose some agency from telling the truth if we can't lie You know, yeah, yeah, I think that's fascinating I think we'll have to have to definitely sort of monitor that ability with of language models of these multiple language models to see The degree to which we still can get away with these with these white lies I think it'll
depend on the application whether we want that or not So I wonder if we can in some sense I mean, I'm imagining we'll have, you know, AGI aside will have have different forms of language models, multi -model language models depending on the applications,
some of which it might be very beneficial to make them tuned into the, you know, expression, the facial expressions and then voice prosody and some of which we may very much not want them to be tuned into this, especially if they're sort of an intermediary to another person that we might want to keep some sort of distance to,
you know, maybe in a negotiation or something like that. We may not want them to have an AI that's reading our every sort of signal. I think that would be. never. - Yeah, that makes a lot of sense. Like there might be just use case by use case,
we want to control, go out of our way to say, I mean, like what do you think of the school or work examples? Like what if my boss could tell from board in a meeting? - Yeah, I think these are already cases that have been sort of flagged by the AI Act in the EU for being things we have to walk and watch out for.
So I think those are definitely things we wanna. - Yeah, you can kind of see it both ways, right? Like there's, it's interesting 'cause like the narrative is usually the opposite. It's like how come the AI can't tell my tone, you know I told my Google like I've heard people say that they're Google home things like yeah things a lot and They're like you're welcome,
and you're like oh you really should have heard that. I was really obviously being sarcastic Yeah, so I guess maybe the prosody dynamic is something along the lines of like what we want in the first instance is if I'm trying to Communicate something with my voice.
It's almost like it can be converted to words, which is like I am being sarcastic You want to communicate that? You are clearly intentionally communicating it. It's almost like you're giving a voice command of like I'm being sarcastic and you want to make sure the AI picks it up and that's pretty separate from perhaps the like I hear from the tone of your voice that your beginnings get frustrated and in about five
minutes you're going to start yelling at me. Yeah, I think it's very true. I was like there's other things where they might not be as describable or like you just maybe start talking more quickly or you sort of, I don't know, you get a little bit more bored tone in your voice and the AI,
let's see what you do like like an interactive podcast with the AI, where you're like asking it to describe the news to you or something. And you clearly, you know, you've already heard this story or something like that. Yeah, it could potentially pick up on the fact that you've,
you know, you're ready for the next topic, just based on the way you're interacting with it. Yeah, I mean, I have to, I got to say, though, I personally do worry a little bit about the agency side of like, I almost do want to, you know, I want to be able to say like,
you want to have like the agency when you interact with AI to some extent, right? Like there is that level of like, what agency has lost, but I guess there's still that level of like just turn it on or off depending on whether or not you want it. Like it doesn't have to be on all the time just because it exists.
Yeah, I was going to say I don't think it's going to be interesting as a broader sort of societal question about LLMs is who sets the objective functions. And it could be, you know, application specific. But I think in a lot of cases, you know, we're of the mind of like a lot of these applications,
at least a lot of applications that we're interested in pursuing are, you know, aimed at improving human well -being in the long run. which is sort of a lot, but in terms of their emotional well -being, things that you would describe like,
"Here are the states I want to be in in my life. These I want to experience love, joy, and happiness," and all those things, and here are the things that undermine that. If the AI is in general trying to nudge you towards those states that you in sort of a reflective state would want to be in,
then I think those are sort of objectives that would be good to have. But it's an interesting question of like, "Do we want more specific applications like your AI that's reading you a podcast or something? Do we want to see better applications?" for anything? Do we want it like, can we set its objective function so that it's almost just more neutral?
We just want, you know, like it to not pick up on these tones of voice and like really try to optimize what we're listening for or like, you know what I'm saying? Yeah. I mean, this is fascinating from like an RL perspective. It's almost like,
what if every application had some slight objective of like, make me happy more and sad less? And then like, instead of having to pick up very explicit signals, you are just, you know, the back of your mind, you're like, if there's something that I could do that's really small,
that'll make the user smile. Maybe it's a good idea. If it makes them frown, maybe do it less. On the flip side, there's obviously this is a terrifying scenario too, right? Yeah. Well, you don't want to optimize for that in a micro level and make them smile.
One interesting thing I had read recently was there was a study where a population was asked very simple. It was like mental health studies. They just asked a group of people like, how do you feel once a day? I think it was, I believe it was like once a day for a few weeks,
it was like, how do you feel? And it was literally just like happy, sad, you know, whatever, some set of things that they labeled. And in the control group, they just didn't ask this. And that was like the whole thing. That was the entire study. And what they found is that on average,
after a period of time, just asking people daily, how do you feel? The population asked daily felt worse. And so the theory, we don't know exactly why this was, was, but like the theory is like, when you're asked at random points in the day daily,
how do you you feel and you reflect you're probably not happy because most people aren't happy most of the time not that you're sad Yeah, you're just engaged right like or yeah engaged or like flow. I don't know about you Like I code a lot when I'm coding Sometimes I'm happy a lot of the time.
I'm just like shut up. Someone asked me how are you feeling? I'd be like shut up. I'm working, you know And I wonder like is there this actually the scary side of like we want to optimize now for happiness or something like four emotions Whereas being engaged is not really an emotion.
It's sort of an absence of emotion to some extent. You know, I'm focusing on someone else's well -being. I'm trying to help someone else. I'm doing charity work. Like if we are focused on emotion, is there the chance that like our objective function is now like skewed in a bad way?
Yeah. I mean, I think we have to probably have a, you know, a deep thought about how to choose that objective function at the right time scale. Because I do think that they're like, you do want it more the months, the years sort of timescale of optimizing for emotional well -being.
And that can look like, you know, certain amount of flow states that may not look like happiness, but that lead to sort of this self -reported satisfaction of life later on. And then I think what's really interesting about that as you as you're talking about this is I think in the therapy domain,
I think we've long wanted the ability to sort of like have these non -invasive readouts of people's emotional states throughout the course of the day for like long periods of time. If you're suffering from depression or anxiety or PTSD,
it's, you know, if you just go to the therapist once a week, describe how you're feeling, it's not a very good snapshot into your life. And I've talked to therapists who would have loved to have these sort of non -invasive abilities to like with the person's permission,
obviously, to get a sense of how their emotions fluctuate throughout the day. And that gives the therapist a bigger, better picture and understanding of, you know, what the person's going through and how to sort of nudge that in the direction that they want to go in. I mean,
I totally see that like, if even asking the question, how do you feel is already invasive, then you're right, like it seems way less invasive too. listen to the tone of your voice or something, right?
Like it's almost like the less it's exposed to the end user, the better in some ways. It's like sort of the, one of the comments I heard on, so just talking about Hume's demo, I guess like you guys worked on this, you built an end -to -end LLM speech -to -speech use case,
right? Where the AI actually understands your prosody from your voice and then responds appropriately. So hopefully when you speak in a happy voice, it speaks in a happy voice, right? Am I getting this right?
Depending on the context, yeah. yeah, or if you're frustrated, it now tries to pick up on that and do the conversation ways of that frustration. Which is really cool. I think one of the big pieces of feedback you're talking about before that I got from someone on Zoom was like,
I noticed, like in a non -invasive way on the side of the screen, what my emotional state was, or I noticed what I was communicating with my voice. And yeah, there is something already interesting just about like, you weren't really expecting an AIDA to hear these things.
Now, suddenly you get to actually see it. Like, you have to imagine it. That this sort of trend It's not gonna end with Hume and it's probably gonna change actually how people do interact with computers more generally Yeah,
I think people will learn to pick up on the fact that well one I think it's nice for people to be aware that they're you know, they're tone of voice for instance does carry all this information So I think even just raising that awareness is useful But then I do think people will you know learn to interact with AI in a different way I think we've learned to interact with Syria in a particular way and we've learned
to interact obviously with Google and our You know, just even the search bar and typing out a computer And I think we're of the mind, or I'm of the mind, that we probably want to interact with, in a lot of applications, we probably want to interact with language models and AI in a way that we interact with humans.
I mean, potentially, the alternative is to learn a new form of interaction that doesn't do this. But I think that it'd be more efficient, and it will leverage so much of what we already communicate if we tailor this as a sort of human -to -human conversation,
as long as we sort of are aware that this is we're interacting with a language model or not. not, you know, not something. Well, I want to push back here. It's really funny. One thing when I use chat GBT, I'm never nice to chat GBT personally because I tend to think that the model performs worse when I am.
Right. So if I'm very like, please, if you don't mind, could you then it responds in the same way as it would respond to someone who's hesitant like that. And if I say something like, you know, do this, it does, you know,
what's interesting by contrast is like when my mom interacts with chat GBT, she does all the. please, you know, you don't know how often she'll like end the conversation with thank you. And I'm like, you're not even like, that's the whole message.
She would just say like, thank you. And like, of course, it's not doing anything. She's obviously personified the AI in her mind. One ramification of this is that I think she'll get lower quality responses. Like obviously we could tune the LLM accordingly,
but she is personifying, imagining it to be a human. I mean, aren't there a lot of levels to dive into this, but I think this is actually the crux of the discussion. Like we're changing. how people interact with computers. We're making it more human.
We're making it closer. We're moving from like programming and clicking a mouse and a button to like interact like you interact with the human. There are a lot of benefits. It's easier. There are a lot of downsides, right? And I think I do worry about the idea of people interacting with computers the way they interact with humans,
personifying, especially when computers don't have all the capabilities humans do. Yeah. So I think there's two, yeah, there's two parts of that. So I think one is that I do really think we need to to, there needs to be a distance. So like people need to be aware that they're interacting with an AI and not another human.
And it doesn't share all the same, you know, as we talked about in the beginning, like, doesn't have this sort of vicarious feelings or experiencing exactly what we're experiencing. And so I do think that people even awareness of this, that they're,
you know, they are interacting with something else and not a human. And then there's this sort of like, can we take advantage of all the sort of natural things we do when interacting with each other, just to speed up the conversation. and make it more fluid and allow us to sort of think more naturally and communicate more naturally,
being aware that we're not communicating with another human that's going to experience things the same way that we're experiencing things. So I think it's a tight line to kind of tell, but I think that would be-- I get you.
It's more like, what can we add into the dimensionality of the space? Like instead of having a mouse where you move it around in two dimensions and click, what else can we add in additionally that we've already been trained to do that could just just add to the richness of the interaction?
Can I get that? If an AI could hear I'm frustrated, then I don't have to tell it I'm frustrated because it can hear I'm frustrated. On the flip side, I think there is the counter argument, which is like, maybe there is a benefit to commands. Maybe in the long term,
you want to have a Hume AI where you have to say, "Hey, Hume," and you have to say it with every message to remind you that if you don't say it, it won't hear you because it's a computer, not a human. Yeah, I think a lot of people are in the process of figuring this out.
I think, you know, one of the things the big language model companies did was, you know, make sure that it always says I'm an AI language model. And, you know, I have, you know, I don't experience emotions, these kinds of things. And I think having these sort of disclaimers is useful the right amount of them.
I think we'll have to figure out and whether we need sort of additional, you know, interactive features that make it, that constantly sort of remind the user that it is this language model in the back end and not another human. I think that's something that I think we'll figure out as,
you know, the whole industry. - Yeah, that's a good point. It's probably going to be everywhere. Can we talk about uncanny valley? So uncanny valley, more generally the idea of uncanny valley, if I remember correctly, I think it comes from like the animation space,
which was just kind of like when we have very simple animations like, you know, brick and mortar or the Simpsons or whatever, it doesn't look like you're trying to be a person, but there's like a scale where you get more and more photo realistic looking more and more like humans.
And then there's sort of, and people. like it more and more, the more realistic you get, until you hit this weird part called uncanny valley, where people start disliking the animation. They'd rather it be less realistic because it just feels wrong.
It feels uncanny. It feels like it's almost there. It's almost real, but it's not quite real. So I guess the question is, do you think there's an uncanny valley in the speech domain? I do think there is, but maybe it doesn't seem as extreme as in other domains,
because there's certain, I mean, speech has not been great for a long time, and yet we've been fine. having AI speech. And I think we're sort of gotten to the point now where we're, you know, there's these issues of voice cloning where you can get very,
very naturalistic speech. And yet I don't really, you can't really point to a point in between where it sounds almost that natural, that people are like, Oh, I can't listen to that. I don't want to interact with that kind of voice. And so I wonder if it's a little bit of a more shallow valley in the case of speech,
but I'm not sure what your thoughts are. I think so. So some of the, some of the specific things are like right now, Humes AI AI if you start with like chat GPT chat GPT has a voice -to -voice setting and The way it works is on the screen It shows you whether it's listening or not and then when it starts listening and then when it stops listening and then it stops It's like processing and then it gives you one
block response It is a pretty realistic voice and that already like reaches people in a lot of ways But there's a lot that the AI doesn't understand it doesn't understand the tone of your voice the speed It doesn't know if you if it mistranscribes something it can't be interrupted.
All the naturalistic things you can do with humans, it doesn't know if your voice sounds more masculine or feminine or anything at your age. And then incrementally, there's that scale of throwing more.
So with Hume, you add one more dimension, which is now we also know if you're frustrated. We also know if you sound down. But unless I'm mistaken, Hume's AI doesn't know the age of my voice,
right? No. I mean, none of that is explicit that we're reading out. out. - Yeah, but it could be. Like you guys could add that tomorrow if you wanted to. Who knows? There's probably a lot of dimensions you could add in. - Yeah, I'm sure there are a lot of dimensions that,
yeah. - So one thing, I mean, I'm curious if you've even heard this from feedback on the product, but like, are there people interacting that are like surprised that the AI is quote unquote ignoring things from their voice because it literally just doesn't know it's there?
- Yeah, I haven't seen any of this feedback yet. I mean, I'm certainly not the only person looking at a lot of this. feedback. So I think some people on the team, you know, we're probably aware of this, but I haven't seen any direct feedback that's been like, "Hey, can you pick up on these other signals "in the voice and loop that into the conversation?" I think there's already so much that it's doing that I think
people are, you know, surprised with this amount already, so. - Yeah, so some of the things that we see, I think we were talking about, we were talking about this before we got on today, but both of us, I think, have had the experience of like,
you create a version of the AI. you ship it, people use it, they give feedback, then they think you shipped a new version, but you actually haven't. Nothing's changed. They use it again, and then they comment on all the things that you've changed in the AI.
Do you think that's related to this topic? I think a little bit. I think people will read into the capabilities of these from just interacting with it. I think part of what we're trying to do is also just be clear as much as we can about what the current state is of the system and what it is and isn't capable of.
So for us, just some examples, I don't know. if you've seen any of these, but like one thing that our AI has done, very random. Like our AI had, at some point, someone asked, like, could you set a reminder on my phone or something?
And yeah, I was like, yeah, totally. What time do you want me to set it for? For instance, like 8 PM. And they were like, OK, cool, I just set it for you. None of this is true. It has no access to your phone. And the feedback we got from that conversation was,
I love that you guys introduced this new feature where it could set reminders on my phone. I know, it's definitely a broader problem about how to prompt. language models and how to make sure that they interact with, and have the conversation go in a way that where the language model is being as truthful as possible about what its capabilities are.
This is just a bigger problem with using language models in the back end of any application. So it's a really interesting problem, I think for all of us. >> So that one's really explicit. Another one we've had was because of the topic I brought up,
the AI started speaking really quickly all of a sudden. >> Yeah. Yeah, that's fascinating I mean, this is something that we're you know part of the reason why we're using explicit signals of different emotions is to gain a Little bit more control over how the AI responds speech rate is something I'm you know personally looking into and it is interesting that it will because these are all you know They're models that are
built similarly to language models They you know they have some flexibility in the way they're going to respond and there's you don't always have control over that And so things like speeding up for an exciting topic is sometimes a desirable feature,
but other times it'll be the case where, you know, you don't want it to speed up, but that's going to convey the wrong signal to the user. And I think, you know, using, you know, larger language models sort of drive those characteristics of the speech is ultimately what we're,
you know, trying to do. But it's quite funny and interesting when it does this sort of, you know, something that a human would never do in this case, yet it's like trying to, you know, change the speed of its voice to, to match this situation. Yeah,
we're, but in this case, I mean, like, I'm imagine pretty often your AI is going to change speed for no particular reason. But like, if you haven't implemented it, for example, like we hadn't implemented it, I don't know if you've implemented the speed thing.
Was that there? Like, can it change its speed? It's partially implemented. So I just mean there could be like, you know, oh, I said something embarrassing, so it started speaking quickly. And you're like, no, it didn't speak quickly because you said something embarrassing,
it can't detect embarrassment, and it can't control its speed. I'm making this up. I don't know if you can check this too. But people. experience it anyway. And then some of the frustration. So another one for you guys, right? Like occasionally the voice is just inconsistent.
Yeah. Engine interruptions are another one. By the way, like, do you allow for interruptions? Like user interrupt Ash, uh, sorry, user interrupt the AI, AI interrupt the user. Yeah. We do allow that right now. It's something that we're working on.
A lot of different groups are working on this. It's surprisingly difficult problem. So I don't really focus on this. There's, you know, we have great engineers that work on this, but it is, you know, surprisingly difficult. to know when to interrupt. You probably don't know when to interrupt me right now based on,
you know, when I'm going to stop speaking. So. And then on the flip side, when the user interrupts the AI, you know, how do you train a language model to be okay with being interrupted? Right? Yeah, that's typical. I think the language models as they are right now are,
you know, the most capable language models are you have the ability to pick up on their previous train of thought sort of and continue the conversation, but it is something that, you know, careful prompting and training have to be done. So what do you even train?
if you train a model with an interruption, won't it learn that with an interruption where it just stops speaking halfway through a sentence, wouldn't it then learn that it's okay to stop speaking halfway through a sentence? Yeah,
if you were doing supervised fine -tuning on those types of transcripts, but I think there are probably other ways to get around that. Like prompting stuff? Yeah. Yeah, I guess it's interesting because I'm trying to go through these set of behaviors that in my head are like,
these are the ones where people would feel weird. And I guess part of your response is like, yeah. but each one of these, we could tackle one at a time." Yeah, I think so. And I think a lot of people are out there trying to tackle that. And do you think that voice -to -voice interaction with Lineman models is the next thing for the next six months for a lot of these companies?
Do you interact with AI emotively when you personally speak to a Google Home and Alexa, chat, GVT, et cetera? I don't really, but in so much as that I'm not trying to,
but I imagine I do. I don't really alter the way I -- well, with the original ones. Yeah. Yeah. Siri, for instance, I, I definitely just interact in a more neutral way, but we've learned those behaviors. And so I think, you know, as we have more expressive agents,
then I think the, you know, we'll, we'll go back, we'll relax back to the way we might communicate with each other over the phone or something. That's fascinating. I guess I never thought of it as a learned behavior, because we really noticed like when people talk to RAI,
they'll often talk in a way that they would never talk to a human where they say like, I was walking down the street. street and I saw a house and they like speak in this way because they know that they can basically.
And I guess there is something here that is just the learned behavior. Is that a bad thing? Like, you know, what are the, if you were to make the case against yourself right now, which was just like, here's why we want to be able to interact with AI in a command format with,
you know, like where you want it to be like, you're in control, the AI doesn't, you don't have to actually like speak the way you speak to a human. What would be the case for that? Yeah. I think, I mean, humans are really adaptable.
And I think there could be, I can't imagine exactly how this would work, but I could imagine developing a different interface where you could, you know, almost learn, let's say use your eyes, if you're interacting with a screen or something, use your eyes or other aspects of your voice to kind of trigger more information that you wouldn't do with another human that could make the conversation extremely efficient,
but in a totally new way. I think humans could maybe learn to do this. I guess it's like what we'll see. in the future if people develop these applications. I think that it's easier to leverage our natural way of communicating.
We don't switch between talking to humans and talking to our LLM applications. But yeah. I do tend to think like it's hard for me to square this image that like we use emotion in our voice for the sake of efficiency.
Like that statement seems to hit me in a way where I'm like, I don't feel like the way I'm talking to you right now is to convey efficiently information. I feel like it's just me interacting. And part of the way that you receive it is very feeling -centric,
right? Not that you actually got a bunch more, like, to convey the same information as what's in my voice probably wouldn't take that many bits, but it wouldn't hit you the same way you being another human, you know? Yeah. I don't know.
I think it's something we definitely should look into than the research side of things. But I get the sense of just writing or emailing or texting. There's so much more that I could convey when I just pick up the phone and call my friend.
And I don't even really, a lot of it is I don't even really need to say the contents sometimes. It's just the way I'm saying it, then he'll pick up on, you know, what I'm actually trying to say. And so I think some of that maybe is just,
you know, about close friendships. But I think some of that would also could be, could also be information efficiency, right? Like just you convey a lot more information on a five second phone call than you do with it. Like that's a five seconds is a long time.
to convey a lot of information Yeah, you're literally saying more words on the call and people are like it's so much easier for me to communicate this on a call Yeah, you also said way more words on that call. Yeah, exactly Part of the reason I think voice interfaces are useful in general whether or not you know the degree to which prosody itself Contains in additional information.
I have the intuition that it does but you know some degree It's just allowing for a easier allowing for voice interface and then allowing, you know, these other characteristics of interruption and those kind of things, which I think are gonna go a long way to just improve the efficiency of LLM human interactions.
- So it is interesting, I guess, like there's this dimensionality thing. There's like, you can use your voice, now you can add prosody, you could also have a mouse where you like show in a two dimensional plane what you're looking at. You can click to indicate that you wanna select something,
like maybe you're using your eyes now to indicate something. So prosody kind of just adds a dimension and happens to be a dimension that we already use every day in our voice. It's interesting. I guess what we'll find out is like,
is it more natural for people to just show they're frustrated than to say I'm frustrated? We'll see. I don't know. I have mixed feelings here. It's interesting 'cause like we're obviously working on an AI therapist, which means that these questions come to us all the time too.
Like, do you want to analyze prosody and voice? And I have such mixed feelings 'cause I get why people would want it. But at the same time, like, I also kind of think there should be a wall where, like, when you are relating to a computer,
you know It's not the same feeling or like at least for a while It won't be the same feeling that you get back as the feeling you get talking to another human And I almost do feel like having those walls is valuable like knowing like when you talk to an AI therapist about an issue You're going to get back computer advice and you're going to you're not going to be able to get the same connection as you will with
a human Hopefully you can still gain something but I worry about like I think there is from my perspective some reason to worry, especially if you're not building AI there, so it's a little bit less worrisome. But for us,
like, we want to make sure that you don't think that the AI can feel your emotions, you know? Yeah. Because we want you to find humans that do. Yeah, exactly. I think it's very AI, application specific. And there's a group to which I think,
as we were talking about earlier, I think different applications should have the ability to turn on or off the different signals that these language models can utilize in their interactions. And I think so, certain situations where we do want a wall or have like a different sort of interface with language models might be beneficial in your case or you know,
sure there's a lot of other cases where it might also be true. - Yeah, I agree with that. All right, so last thing just to go through, I guess we talked a lot about emotion, a little bit about empathy. I don't think we really need to dive too far down the empathy path though.
I think it is really interesting. One thing I'd wonder is like, there's a big debate in academic, you know, research ML which is basically like, do models actually gain instrumental skills, right? So on the other hand, the optimistic side,
there's this idea that models gain instrumental skills. If you teach, you know, reasoning, if you show enough math, it actually learns how additional works, multiplication, division, it doesn't just memorize. And there's like a flip side argument,
which is like, no, it actually doesn't even learn those things. All it learns to do is to memorize. Do you think that there's any legitimacy to the memorize argument? And if so, does that change anything for AI learning emotion?
So I think it does more than just memorize, but there's an interesting middle. middle ground of having the entire training data set and then interpolating in between it. And it's an interesting question of how much of human knowledge and how much of mathematical knowledge and emotional knowledge is just this interpolation between examples in some high dimensional space versus some truly novel construction of existing knowledge
that branches out that would be more extrapolation. And I think for, it's interesting, I can't imagine because we can't can't imagine how much information is on the internet, it's really hard to imagine like how much of this, you know, how much of this mathematical ability even can just be like stitched together by interpolating existing examples and how much of it needs to be like a genuinely new kind of knowledge
construction. I think for a lot of emotional understanding, there's been so much discussion on the internet that I think you can get a lot of, you know, a lot of the way there by not memorizing examples, but by, you know, interpolating between existing examples.
And then I think, you know, it's an open question of like, how much it needs to act. sort of engage interactively, actively with the environment to kind of learn more or go through some sort of like interactive training with other,
you know, experts in these kind of domains. It'll be interesting to see like, you see where we go beyond pre -training, obviously, with these sort of as language models interact, you know, with reasoning and they get, you know, true and false feedback from the environment that sort of,
you know, how much that's going to build on top of there in a sort of like corpus that they've been trained on. Yeah. I guess on the understanding side of something like emotion, it makes sense that like, if you're hearing my voice,
you've probably heard enough, like even as a human, you've probably heard enough voices that you are like interpolating, you're like, this sounds like a thing I've heard a million times in similar ways. On the other side, if you're generating with a language model or with a voice model where you're like injecting something,
that's kind of where I wonder like, there is something I think to like emotional reasoning. And that's where I'd wonder like, is there a risk or is there something we could even just say? here and find out factually the answer of like,
are models doing emotional reasoning or are they like quote unquote cheating? Like, are they just able to say like, it sounds like you feel sad or did they actually like emotionally reason? What is it that might have made you feel sad and how can I change what I say next in order to make sure that you no longer feel sad?
You know? Yeah, I think this is something, you know, good evaluations on this would be useful in developing like an actual benchmark for this kind of thing would be really useful to look at. Yeah, I think a lot of it is these, you know, a lot of what we we test right now is surface level features.
So it's like, if you gave this audio clip to some human, how would they describe it essentially? And that seems to me much more like, that is much more like interpolation. Obviously, there's no reason behind that single audio clip. In a larger context,
then, yeah, I think it's, it's a no big question of like, is the prediction about how the humans express this right now, driven just by the, you know, the acoustic characteristics in that moment, or is it driven by, you know,
the whole, like, context of the conversation. conversation? And this is something that I'm really interested in figuring out and evaluating, but it's very much an open problem. So we should expect to see the answers really soon. In this space, what are the big developments that you're most excited to see happen soon?
I am-- well, I'm really excited, obviously, about the capabilities of multi -model language models. And so, obviously, the GBD4 has-- GBD4 and Gemini have amazing capabilities with images. I think the audio demand is a little bit laggy behind,
but I'm really curious to see where that goes. And so, I'm really excited to see where that goes. that's, I mean, that's one of the most interesting things. I also, you know, you and I are both interested in language model reasoning. So we'll see where that goes. I think actually, long time back,
but our first conversation, I believe it was, is when I asked you, "When do you think we'll get to speech -to -speech models that are just end -to -end whisper style, enter in speech, get speech back?" Yeah, that's, I mean,
so hard to estimate when that's going to happen. I think ultimately, ultimately something like that will happen. I mean, end -to -end models are just the way that... things move. It's just that right now it's much more of a hierarchical piece together system that everyone's using.
And I think mostly just practical. Audio information is such a high bit rate. And you just can't have a traditional language model operate at that level yet. So I would give it a few years,
I would say. Although, I'm thinking open AI released Whisper, not that long ago. And Whisper was literally, like, one thing. it can do is Take in speech as audio and output a translation into another language Yeah,
so it is like an end -to -end language model that not only you know received speech It actually understands it well enough that it can translate it to another language and output like you know Speak in Portuguese get English text back.
That is so cool. Like it is. Yeah makes me think exactly So it can be done in the local context for sure for a very short context You know, I am a little bit skeptical that can be scaled up to like,
you know full paragraphs or not paragraphs for like full Conversation level context, but you know, we'll see people are moving a lot of people are working on these things So it could be a long way away It also could be a bitter lesson moment where we all are a little sad about all the work We did and then just last question any books or papers that you'd recommend for people to read if they want to learn more
about this Space about this space Well, I would suggest people read go to our website actually and look at all the scientific papers that a lot of my colleagues have done in this space. A lot of people before moving into the AI space were doing research on just making emotion science more rigorous and quantitative.
I think a lot of those papers are a great foundation. Awesome. Well, thanks so much for joining us, Chris. Yeah, thanks for having me. All right. That was Chris Gagnier on AI and Emotion. That was a lot of fun.
Like you said, if you want to check out some more from Hume, you can check out their website. website. They have some research that they've shared there from Chris and his colleagues. Also, we'd love to hear any feedback you have to share, so feel free to reach out with any ideas or notes at daniel at slingshot dot xyz.

Creators and Guests

Daniel Reid Cahn
Host
Daniel Reid Cahn
Founder @ Slingshot - AI for all, not just Goliath
On Emotionally Intelligent AI (with Chris Gagne, Hume AI)
Broadcast by