Jagged Intelligence Is What’s Wrong With Enterprise AI According to Salesforce

Today’s AI is jagged, so we need to work on that.

Salesforce is addressing a core obstacle in deploying artificial intelligence across business environments: the inconsistency of AI performance in unpredictable, real-world scenarios. Described by the company as “jagged intelligence,” this phenomenon refers to the uneven abilities of AI systems where high performance in certain tasks fails to translate into reliable execution across enterprise functions.

Unveiled during a detailed research briefing, Salesforce AI Research introduced new models, benchmarks, and agent frameworks aimed at closing this gap. The company’s goal is to improve not only the capabilities but also the consistency of AI systems operating in business-critical contexts, what it now defines as “Enterprise General Intelligence,” or EGI.

“While LLMs may excel at standardized tests, plan intricate trips, and generate sophisticated poetry, their brilliance often stumbles when faced with the need for reliable and consistent task execution in dynamic, unpredictable enterprise environments,” said Silvio Savarese, Salesforce’s Chief Scientist and Head of AI Research.

Savarese emphasized that Salesforce is not pursuing Artificial General Intelligence (AGI), but instead focusing on EGI AI agents designed specifically for business demands, where operational consistency and contextual understanding are essential. “While AGI may conjure images of superintelligent machines surpassing human intelligence, businesses aren’t waiting for that distant, illusory future. They’re applying these foundational concepts now to solve real-world challenges at scale,” he added.

A central part of this initiative is the SIMPLE dataset, a newly introduced public benchmark composed of 225 straightforward reasoning questions. The benchmark was designed to quantify and surface the irregularities in AI behavior—making “jaggedness” observable and measurable.

“Today’s AI is jagged, so we need to work on that. But how can we work on something without measuring it first? That’s exactly what this SIMPLE benchmark is,” explained Shelby Heinecke, Senior Manager of Research at Salesforce.

In enterprise settings, such inconsistency is not just a technical flaw and it can have operational and financial consequences. As Savarese noted, “For businesses, AI isn’t a casual pastime; it’s a mission-critical tool that requires unwavering predictability.”

To test agents in more grounded business contexts, Salesforce also unveiled CRMArena, a benchmarking framework that simulates realistic customer relationship management (CRM) tasks. The framework assesses AI performance across three professional personas like service agents, analysts, and managers highlighting how agents handle complex workflows.

“Recognizing that current AI models often fall short in reflecting the intricate demands of enterprise environments, we’ve introduced CRMArena: a novel benchmarking framework meticulously designed to simulate realistic, professionally grounded CRM scenarios,” said Savarese.

Initial tests showed that even top agents struggled with CRM-relevant function calls, with success rates below 65% even under guided prompting. “The CRM arena essentially is a tool that’s been introduced internally for improving agents,” he added. “It allows us to stress test these agents, understand when they’re failing, and then use these lessons we learn from those failure cases to improve our agents.”

To support more contextual understanding in business data, Salesforce introduced SFR-Embedding, a new embedding model that now leads the Massive Text Embedding Benchmark (MTEB) across 56 datasets. A specialized version is SFR-Embedding-Code which was also released, aimed at enhancing code search for developers. The largest version of the model (7 billion parameters) tops the Code Information Retrieval (CoIR) benchmark, with smaller models (400M and 2B) designed to offer more efficient alternatives.

“SFR embedding is not just research. It’s coming to Data Cloud very, very soon,” Heinecke confirmed.

A related effort is Salesforce’s new xLAM V2 (Large Action Model) family—models trained not just to generate language, but to predict and execute the next action in enterprise workflows. Starting at just 1 billion parameters, these models are significantly smaller than typical LLMs yet designed specifically for task execution.

“What’s special about our xLAM models is that if you look at our model sizes, we’ve got a 1B model, we all the way up to a 70B model. That 1B model, for example, is a fraction of the size of many of today’s large language models,” said Heinecke. “This small model packs just so much power in taking the ability to take the next action.”

Heinecke added that these models are built by fine-tuning standard LLMs on what Salesforce calls “action trajectories,” enabling more robust performance when integrated into enterprise systems.

In parallel, Salesforce announced SFR-Guard, a suite of models focused on safety and compliance. Trained on both public and internal CRM data, these models are part of Salesforce’s “Trust Layer” that enforces behavioral guardrails for AI agents.

“Agentforce’s guardrails establish clear boundaries for agent behavior based on business needs, policies, and standards, ensuring agents act within predefined limits,” the company stated.

Another addition is ContextualJudgeBench, a benchmark created to evaluate LLM-based judge models. It tests over 2,000 response pairs for four core criteria: accuracy, conciseness, faithfulness, and appropriate refusals.

For multimodal capabilities, Salesforce introduced TACO, a new family of models that combine visual and textual reasoning across multi-step tasks using a method called chains of thought-and-action (CoTA). On the MMVet benchmark, Salesforce reported a 20% performance improvement using this approach.

“When we’re talking to customers, one of the main pain points that we have is that when dealing with enterprise data, there’s a very low tolerance to actually provide answers that are not accurate and that are not relevant,” said Itai Asseo, Senior Director of Incubation and Brand Strategy at AI Research. “We’ve made a lot of progress, whether it’s with reasoning engines, with RAG techniques and other methods around LLMs.”

He cited specific improvements through customer co-innovation projects: “When we applied the Atlas reasoning engine, including some advanced techniques for retrieval augmented generation, coupled with our reasoning and agentic loop methodology and architecture, we were seeing accuracy that was twice as much as customers were able to do when working with kind of other major competitors of ours.”

The developments were released as part of Salesforce’s inaugural AI Research in Review report, a quarterly roundup of foundational advances and practical innovations. Savarese described these outputs as “boring breakthroughs”—models and frameworks that are reliable and quietly essential for AI systems to work at scale in business.

“At Salesforce, we call these ‘boring breakthroughs’ not because they’re unremarkable, but because they’re quietly capable, reliably scalable, and built to endure,” he said. “They’re so seamless, some might take them for granted.”

In an era where AI models are rapidly growing in size and complexity, Salesforce is taking a different route: focusing on performance that holds up across real-world variability. Their approach shifts attention from record-setting benchmarks to applied consistency, especially in CRM and other enterprise systems.

As Savarese put it: “It’s not about replacing humans. It’s about being in charge.”

📣 Want to advertise in AIM Research? Book here >

Picture of Anshika Mathews
Anshika Mathews
Anshika is the Senior Content Strategist for AIM Research. She holds a keen interest in technology and related policy-making and its impact on society. She can be reached at anshika.mathews@aimresearch.co
Subscribe to our Latest Insights
By clicking the “Continue” button, you are agreeing to the AIM Media Terms of Use and Privacy Policy.
Recognitions & Lists
Discover, Apply, and Contribute on Noteworthy Awards and Surveys from AIM
AIM Leaders Council
An invitation-only forum of senior executives in the Data Science and AI industry.
Stay Current with our In-Depth Insights
The Most Powerful Generative AI Conference for Enterprise Leaders and Startup Founders

Cypher 2024
21-22 Nov 2024, Santa Clara Convention Center, CA

25 July 2025 | 583 Park Avenue, New York
The Biggest Exclusive Gathering of CDOs & AI Leaders In United States
Our Latest Reports on AI Industry
Supercharge your top goals and objectives to reach new heights of success!