This is super interesting. But I have to wonder how much it costs on the back end - it sounds like it’s essentially just running a boatload of specialized agents, constantly, throughout the whole interaction (and with super-token-rich input for each). Neat for a demo, but what would it cost to run this for a 30 minute job interview? Or a 7 hour deposition?
Another concern I’d have is bias. If I am prone to speaking loudly, is it going to say I’m shrill? If my camera is not aligned well, is it going to say I’m not making eye contact?
So the conversational agent runs on a provisioned chunk of compute already, but that chunk isn't utilized to 100% of its provisioned capacity. For this perception system we're taking advantage of the spare compute left on what's provisioned for a top-level agent, so turning this on costs nothing "extra"
Bias is a concern for sure, though it adapts to your speech pattern and behaviors in the duration of a single conversation, so ack'ing you not making eye contact because say your camera is on a different monitor, it'll make the mistake once and not refer to that again.
Hmm.. My first thought is that great, now not only will e.g., HR/screening/hiring hand-off the reading/discerning tasks to an ML model, they'll now outsource the things that require any sort of emotional understanding (compassion, stress, anxiety, social awkwardness, etc) to a model too.
One part of me has a tendency to think "good, take some subjectivity away from a human with poor social skills", but another part of me is repulsed by the concept because we see how otherwise capable humans will defer to "expertise" of an LLM due to a notion of perceived "expertise" in the machine, or laziness (see recent kerfuffles in the legal field over hallucinated citations, etc.)
Objective classification in CV is one thing, but subjective identification (psychology, pseudoscientific forensic sociology, etc) via a multi-modal model triggers a sort of danger warning in me as initial reaction.
Appreciate the feedback truly. It's an interesting concept to explore, deferring human "expertise" to technology has been happening throughout the years (most definitely accelerated in recent times), for which we have found ways to adapt / abstract over the work being deferred, but the growing pains are probably the most acute when such deferment happens rapidly, as in the case of AI.
Don't want this to turn into a Matt Damon in Elysium type of situation for sure with that scene with the parole officer hahah (which would stem from a poor integration of such subjective signals into existing workflows, more so than the availability of those signals)
For emotional intelligence, I personally see this as a prerequisite for any voice / language model that's interacting with humans, just like how an autonomous car has to be able to identify a pothole, so does a voice / video agent navigating a pothole in a conversation.
High level, rolling buffer that uses the spare compute we're allocated for a conversation to achieve <80ms p50 results, using signals labeled from raw convo data to align a small language model to produce these natural language descriptions
Another concern I’d have is bias. If I am prone to speaking loudly, is it going to say I’m shrill? If my camera is not aligned well, is it going to say I’m not making eye contact?
Bias is a concern for sure, though it adapts to your speech pattern and behaviors in the duration of a single conversation, so ack'ing you not making eye contact because say your camera is on a different monitor, it'll make the mistake once and not refer to that again.
One part of me has a tendency to think "good, take some subjectivity away from a human with poor social skills", but another part of me is repulsed by the concept because we see how otherwise capable humans will defer to "expertise" of an LLM due to a notion of perceived "expertise" in the machine, or laziness (see recent kerfuffles in the legal field over hallucinated citations, etc.)
Objective classification in CV is one thing, but subjective identification (psychology, pseudoscientific forensic sociology, etc) via a multi-modal model triggers a sort of danger warning in me as initial reaction.
Neat work, though, from a technical standpoint.
Don't want this to turn into a Matt Damon in Elysium type of situation for sure with that scene with the parole officer hahah (which would stem from a poor integration of such subjective signals into existing workflows, more so than the availability of those signals)
For emotional intelligence, I personally see this as a prerequisite for any voice / language model that's interacting with humans, just like how an autonomous car has to be able to identify a pothole, so does a voice / video agent navigating a pothole in a conversation.
Candidate: That's the hotel.
HR: What?
Candidate: Where I live.
HR: Nice place?
Candidate: Yeah, sure. I guess. Is that part of the test?
HR: No. Just warming you up, that's all.