Fargo Handled More Than 200 Million Requests Without Sending Customer Data to an LLM

A production-grade generative AI assistant that handled 245.4 million interactions in 2024 alone, more than doubling its original forecast and it did so without ever sending sensitive customer data to a language model.

Not a pilot. Not a proof of concept. A production-grade generative AI assistant that handled 245.4 million interactions in 2024 alone, more than doubling its original forecast and it did so without ever sending sensitive customer data to a language model.

Most banks are still running internal trials and whiteboarding risk scenarios. Wells Fargo built Fargo, rolled it out inside the app, and scaled it to handle everyday banking tasks via voice and text: bill pay, transfers, transaction lookups, account questions. The tool averages multiple interactions per session, and usage has exploded: from 21.3 million interactions in 2023 to 245.4 million in 2024, totaling 336 million since launch.

Fargo works not because of a flashy model, but because of architecture. Customer speech is transcribed locally. That text is then scrubbed and tokenized by Wells Fargo’s own stack, with a small language model (SLM) detecting PII. Only after that does the system call out to Google’s Gemini Flash 2.0, which extracts the user’s intent and relevant entities never raw input, never names or account data.

“The orchestration layer talks to the model,” said Wells Fargo CIO Chintan Mehta in an interview with VentureBeat. “We’re the filters in front and behind.”

This isn’t semantic hair-splitting. It’s what makes Fargo more than a demo. The model doesn’t compute balances or move money it just understands what the user is trying to do. “All the computations and detokenization, everything is on our end,” Mehta said. “Our APIs… none of them pass through the LLM. All of them are just sitting orthogonal to it.”

Contrast that with other major banks. At Citi, analytics chief Promiti Dutta said the risks of external-facing LLMs are still too high. Speaking at a VentureBeat event last year, Dutta described a system where assist agents support internal tasks only, precisely because of hallucination risk and data sensitivity.

Wells Fargo made hallucinations irrelevant. Its orchestration design keeps generative models on a tight leash, far away from any actual decision-making or customer data. And it’s working especially with Spanish-speaking users. Since its Spanish-language rollout in September 2023, Fargo’s Spanish usage now accounts for more than 80% of total interactions.

This is not a one-model operation, either. Mehta’s team has adopted a “compound systems” approach: Google’s Gemini Flash 2.0 powers Fargo, but internally, Wells Fargo also uses smaller models like Llama, and OpenAI models are brought in as needed. “We’re poly-model and poly-cloud,” Mehta said. While the bank leans on Google Cloud, it also uses Microsoft Azure. Model performance, he argued, has plateaued: the difference between top-tier models is now marginal. What matters is how they’re orchestrated.

Still, some models maintain advantages in specialized tasks. According to Mehta, Claude Sonnet 3.7 and OpenAI’s o3 mini remain strong in coding; OpenAI’s o3 excels at deep research. But the meaningful performance gap, in his view, is context window size. On that front, Gemini 2.5 Pro’s 1M-token capacity gives it a decisive edge in retrieval-augmented generation (RAG)—where the need to pre-process large volumes of unstructured data adds latency. “Gemini has absolutely killed it when it comes to that,” Mehta said.

But Fargo isn’t even the most autonomous thing Wells Fargo has built. In a recent internal project, the bank used a network of interacting AI agents, some based on open-source frameworks like LangGraph, to re-underwrite 15 years of archived loan documents. The agents retrieved documents, extracted key data, matched it against systems of record, and executed downstream calculations. A human reviewed the final output, but most of the work ran independently.

And Mehta is already looking past current model comparisons. While everyday tasks are now handled reliably by most systems, reasoning is where differentiation still exists. Some models “do it better than others, and they do it in different ways,” he said. That’s where Wells Fargo is now evaluating next steps.

Meanwhile, over at Bank of America, the AI strategy looks less flashy but no less consequential. Aditya Bhasin, Chief Technology & Information Officer, says AI is already transforming operations. “Our use of AI at scale and around the world enables us to further enhance our capabilities, improve employee productivity and client service, and drive business growth.”

Bank of America launched its virtual assistant Erica in 2018. Since then, it has handled over 2.5 billion client interactions, with 20 million active users today. Unlike many companies still drawing up AI roadmaps, Bank of America has had one in motion for seven years.

That infrastructure is now supporting internal systems, too. Erica for Employees, launched in 2020, became essential during the pandemic and now handles tasks like password resets, device activations, payroll questions, and HR form retrieval. Over 90% of Bank of America employees use it, and it has reduced IT service desk calls by more than 50%. The system is being expanded in 2024 with search and generative AI capabilities that will eventually allow employees to get answers on Bank of America products and services in natural language.

The bank’s approach emphasizes human oversight, transparency, and accountability but make no mistake, this is large-scale automation in action.

All of this lands at a crucial moment for Google. This week’s Google Cloud Next conference in Las Vegas is set to spotlight a wave of agentic AI efforts. CEO Thomas Kurian has already signaled the future: agents that “connect with other agents” to “achieve specific goals.” That’s not a vision. It’s already reality just not where most people are looking.

Back at Wells Fargo, Mehta warned against the industry’s distraction with novelty. “We have to be very thoughtful about not getting caught up with shiny objects,” he said. The models are good enough. The real bottleneck, he argued, won’t be GPUs or benchmarks. It’ll be something much more basic.

“The constraint isn’t going to be the chips,” Mehta said. “It’s going to be power generation and distribution. That’s the real bottleneck.”

📣 Want to advertise in AIM Research? Book here >

Picture of Anshika Mathews
Anshika Mathews
Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co
Subscribe to our Latest Insights
By clicking the “Continue” button, you are agreeing to the AIM Media Terms of Use and Privacy Policy.
Recognitions & Lists
Discover, Apply, and Contribute on Noteworthy Awards and Surveys from AIM
AIM Leaders Council
An invitation-only forum of senior executives in the Data Science and AI industry.
Stay Current with our In-Depth Insights
The Most Powerful Generative AI Conference for Enterprise Leaders and Startup Founders

Cypher 2024
21-22 Nov 2024, Santa Clara Convention Center, CA

25 July 2025 | 583 Park Avenue, New York
The Biggest Exclusive Gathering of CDOs & AI Leaders In United States
Our Latest Reports on AI Industry
Supercharge your top goals and objectives to reach new heights of success!