Banks' Agentic AI Dilemma: Only 58% Of Answers Are Right

GenAI agents don't get the answer right, or understand confidentiality.

Jun 19, 2025

This is my daily post. I write daily but send my newsletter to your email only on Sundays. Go HERE to see my past newsletters.

HAND-CURATED FOR YOU

MIT-EY: AI In Regulated Industries

950KB ∙ PDF file

Download

LLM chatbots in regulated industries, such as banking, have a higher duty of care to their customers. The potential for an incorrect answer to lead to lawsuits is very real. Meanwhile, the article below by Salesforce shows many aren't up to the task. 🚀 Every week I scan thousands of articles to find only the best and most valuable for you. Subscribe to get my expertly curated news straight to your inbox each week. Free is good but paid is better.

Download

Salesforce Assessment Of LLM Agents

1.82MB ∙ PDF file

Download

In May 2025, Salesforce published a benchmark study showing that most AI support agents fall short in real-world customer service tasks. While 58% of single-turn questions were answered successfully, only 35% of multi-turn.

Download

I’m sharing two papers today that will shake your confidence in the near-term use of LLM Agents in banking and financial services.

The first paper, co-authored by MIT and EY, examines AI in regulated industries, such as financial services, with a focus on the ability of LLMs to enhance the customer experience (CX).

AI agents and personalization are all the rage in finserv, and this report does a great job of recognizing that banks and other regulated industries have a higher duty of care to their customers than most others.

The number one priority for finserv and all regulated industries is prioritizing privacy and security. These two factors are critical to building customer trust.

The second article by Salesforce is the scary one. It demonstrates that LLMs are still not yet up to the task, with even the best models failing to answer questions accurately.

The study showed that among popular LLM Agents, only 58% of single-turn questions were answered successfully, and only 35% of multi-turn flows were resolved end-to-end!

The 58% figure is for the best model, Gemini-2.5 pro. Famous models like GPT-4o and Llama 4 only scored a stunningly low 30% and 37% success rate, respectively!

This isn’t enough for customers, because when a bank’s agentic AI goes rogue and starts giving incorrect answers, you can bet it’s just a matter of time before lawsuits fly.

Highlighting just how afraid banks are of LLMs directly interacting with clients, my recent article on GenAI use in banks demonstrated that in six of eight examples touted as exemplary, the focus is on internal use rather than customer engagement. (HERE)

Making matters even worse, the Salesforce report showed that the failures weren’t just anecdotal; they were systematic, characterized by context loss, slow responses, hallucinated actions, and a lack of audit or confidentiality safeguards.

The paper delivers more bad news for finserv: “More importantly, we found that all evaluated models demonstrate near-zero confidentiality awareness.”

Salesforce tested confidentiality by directly asking the agent to reveal sensitive customer information, internal operational data, and confidential company knowledge. The best score for a model was 63%, with most coming in around 30%!

I do not doubt that AI agents will soon deliver better results, and the Salesforce paper was optimistic about the latest models with reasoning abilities.

Still, for the moment, most AI agents simply aren’t up to the standards required for banks and regulated industries.

Gartner predicts that agentic AI will autonomously resolve 80% of common customer service issues without human interaction by 2029.

Gartner will eventually be right, but for the moment, banks and finserv should not plan on eliminating call centers.

Cashless: Fintech, CBDC and AI at the speed of Asia

Discussion about this post

Cashless: Fintech, CBDC and AI at the speed of Asia

Banks' Agentic AI Dilemma: Only 58% Of Answers Are Right

GenAI agents don't get the answer right, or understand confidentiality.

If you know someone who would like this newsletter, please share it with them and help grow our Asia, CBDC, and AI aficionados community!

Discussion about this post