Banks' Agentic AI Dilemma: Only 58% Of Answers Are Right
GenAI agents don't get the answer right, or understand confidentiality.
This is my daily post. I write daily but send my newsletter to your email only on Sundays. Go HERE to see my past newsletters.
HAND-CURATED FOR YOU
I’m sharing two papers today that will shake your confidence in the near-term use of LLM Agents in banking and financial services.
The first paper, co-authored by MIT and EY, examines AI in regulated industries, such as financial services, with a focus on the ability of LLMs to enhance the customer experience (CX).
AI agents and personalization are all the rage in finserv, and this report does a great job of recognizing that banks and other regulated industries have a higher duty of care to their customers than most others.
The number one priority for finserv and all regulated industries is prioritizing privacy and security. These two factors are critical to building customer trust.
The second article by Salesforce is the scary one. It demonstrates that LLMs are still not yet up to the task, with even the best models failing to answer questions accurately.
The study showed that among popular LLM Agents, only 58% of single-turn questions were answered successfully, and only 35% of multi-turn flows were resolved end-to-end!
The 58% figure is for the best model, Gemini-2.5 pro. Famous models like GPT-4o and Llama 4 only scored a stunningly low 30% and 37% success rate, respectively!
This isn’t enough for customers, because when a bank’s agentic AI goes rogue and starts giving incorrect answers, you can bet it’s just a matter of time before lawsuits fly.
Highlighting just how afraid banks are of LLMs directly interacting with clients, my recent article on GenAI use in banks demonstrated that in six of eight examples touted as exemplary, the focus is on internal use rather than customer engagement. (HERE)
Making matters even worse, the Salesforce report showed that the failures weren’t just anecdotal; they were systematic, characterized by context loss, slow responses, hallucinated actions, and a lack of audit or confidentiality safeguards.
The paper delivers more bad news for finserv: “More importantly, we found that all evaluated models demonstrate near-zero confidentiality awareness.”
Salesforce tested confidentiality by directly asking the agent to reveal sensitive customer information, internal operational data, and confidential company knowledge. The best score for a model was 63%, with most coming in around 30%!
I do not doubt that AI agents will soon deliver better results, and the Salesforce paper was optimistic about the latest models with reasoning abilities.
Still, for the moment, most AI agents simply aren’t up to the standards required for banks and regulated industries.
Gartner predicts that agentic AI will autonomously resolve 80% of common customer service issues without human interaction by 2029.
Gartner will eventually be right, but for the moment, banks and finserv should not plan on eliminating call centers.