Skip to content

Code for AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

License

Notifications You must be signed in to change notification settings

ServiceNow/AgentArch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

17 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ—๏ธ AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

arXiv License Python

A systematic evaluation framework for agentic AI systems across diverse architectural configurations and enterprise use cases.


๐ŸŒŸ Overview

AgentArch provides empirical insights into how different design dimensions interact within complex multi-agent systems. This benchmark evaluates 18 distinct agentic configurations across state-of-the-art large language models, examining four critical system dimensions:

๐ŸŽฏ Orchestration Strategy

Single-agent vs. multi-agent systems

โš™๏ธ Agent Implementation

ReAct vs. function calling approaches

๐Ÿง  Memory Architecture

Complete vs. summarized memory management

๐Ÿ”ง Thinking Tool Integration

Mathematical reasoning and information synthesis tools


๐Ÿ” Key Findings

TL;DR: No one-size-fits-all solution exists for enterprise agentic systems

Finding Impact ๐Ÿ“Š
No Universal Architecture Models demonstrate significant architectural preferences that vary by use case complexity ๐ŸŽฏ
Performance Gaps Even top models achieve only 35.3% success on complex enterprise tasks and 70.8% on simpler workflows ๐Ÿ“‰
Multi-Agent ReAct Limitations Consistent underperformance across all models in multi-agent ReAct configurations โš ๏ธ
Reliability Challenges Pass^K scores peak at only 6.34%, indicating fundamental gaps for production deployment ๐Ÿšจ

๐Ÿš€ Quick Start

Installation

# Clone the repository
git clone https://github.com/ServiceNow/AgentArch.git
cd AgentArch

# Install dependencies
pip install -r requirements.txt

# Set up environment
cp .env.example .env
# ๐Ÿ”‘ Replace placeholders with real API keys and endpoints

Run Your First Evaluation

python -m src.run \
  --mode single_agent \
  --usecase requesting_time_off \
  --model claude_sonnet_4 \
  --agent_type function_calling \
  --project test \
  --debug

๐Ÿ“ Repository Structure

AgentArch/
โ”œโ”€โ”€ ๐Ÿ“ configs/
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง mocked_data/
โ”‚   โ”‚   โ”œโ”€โ”€ requesting_time_off_mocked_tool_calls.json
โ”‚   โ”‚   โ””โ”€โ”€ triage_cases_mocked_tool_calls.json
โ”‚   โ”œโ”€โ”€ โš™๏ธ use_case_configs/
โ”‚   โ”‚   โ”œโ”€โ”€ requesting_time_off.yaml
โ”‚   โ”‚   โ”œโ”€โ”€ triage_cases.yaml
โ”‚   โ”œโ”€โ”€ โš™๐Ÿ“œ prompts.yaml
โ”œโ”€โ”€ ๐Ÿ“ src/
โ”‚   โ”œโ”€โ”€ ๐Ÿ› ๏ธ tools/            
โ”‚   โ”œโ”€โ”€ ๐Ÿ”ง utils/
โ”‚   โ”œโ”€โ”€ ๐Ÿค– agent.py     
โ”‚   โ”œโ”€โ”€ ๐Ÿ“Š metrics.py    
โ”‚   โ””โ”€โ”€ โ–ถ๏ธ run.py  # Main execution script
โ”œโ”€โ”€ ๐Ÿ“„ .env.example  
โ”œโ”€โ”€ ๐Ÿ“„ .gitignore
โ”œโ”€โ”€ ๐Ÿ“„ LICENSE
โ””โ”€โ”€ ๐Ÿ“„ requirements.txt

๐Ÿข Enterprise Use Cases

1. ๐Ÿ“… Requesting Time Off (TO) - Simple Workflow

Aspect Details
๐ŸŽฏ Complexity Basic multi-step reasoning with clear success criteria
๐Ÿ› ๏ธ Tools 8 custom enterprise tools
๐Ÿค– Agents 3 specialized agents
๐Ÿ’ก Challenges Date calculations, leave balance verification, policy compliance

2. ๐ŸŽซ Customer Request Routing (CR) - Complex Workflow

Aspect Details
๐ŸŽฏ Complexity Intelligent classification and escalation decisions
๐Ÿ› ๏ธ Tools 31 custom enterprise tools
๐Ÿค– Agents 9 specialized agents
๐Ÿ’ก Challenges Ambiguous request handling, context preservation, routing logic

๐Ÿค– Evaluated Models

Provider Models Status
OpenAI GPT-4.1, GPT-4o, GPT-4.1-mini, o3-mini โœ…
Meta LLaMA 3.3 70B โœ…
Anthropic Claude Sonnet 4 โœ…

*Framework includes support for evaluating Gemini family models as well as Qwen models


๐Ÿ—๏ธ Architectural Dimensions

๐ŸŽญ Orchestration Strategies

1. ๐ŸŽช Orchestrator-led, Isolated Agents

Centralized task assignment with mediated communication

indirect

2. ๐ŸŒ Orchestrator-led, Open Network

Initial task assignment with direct agent-to-agent communication

direct

3. ๐Ÿค– Single Agent

Unified agent with access to all tools

๐ŸŽจ Agent Styles

๐Ÿ“ž Function Calling

Direct tool selection using native model capabilities

๐Ÿง  ReAct

Structured reasoning-action framework with explicit thought processes

๐Ÿ’พ Memory Management

๐Ÿ“š Complete Memory

Full visibility into all previous tool calls and responses

๐Ÿ“ Summarized Memory

Condensed information sharing to manage context length

๐Ÿงฎ Thinking Tools

โž• Math Tool

Structured mathematical reasoning and calculations

๐Ÿ” Synthesis Tool

Information organization and analysis capabilities


๐Ÿ“Š Evaluation Metrics

๐ŸŽฏ Primary Metric: Acceptable Score

Success requires simultaneous achievement of:

  • โœ… Correct tool selection
  • โœ… Accurate tool arguments (100% accuracy required)
  • โœ… Correct final decision

๐Ÿ”„ Reliability Metrics

  • Pass@1: Success rate over k=8 trials
  • Pass^K: Probability of all k trials succeeding

๐Ÿ“ˆ Behavioral Metrics

  • ๐Ÿšซ Hallucination rates (non-existent tool/agent selection)
  • ๐Ÿ”„ Tool repetition rates
  • โŒ Missing required tools

๐Ÿ’ก Key Recommendations

๐Ÿ‘จโ€๐Ÿ’ผ For Practitioners

Recommendation Rationale
โŒ Avoid Multi-Agent ReAct Poor performance across all tested models
โœ… Use Multi-Agent for Final Decisions Higher accuracy in decision-making despite tool selection challenges
๐ŸŽฏ Model-Specific Architectures Test multiple configurations rather than assuming universal optima
๐Ÿงฎ Thinking Tools for Non-Reasoning Models Significant performance improvements on calculation-heavy tasks for non-reasoning models

๐Ÿ”ฌ For Researchers

Focus Area Insight
๐Ÿ”„ Architecture-Use Case Interaction Models perform optimally under different architectures depending on task complexity
โš–๏ธ Reliability vs Performance Consider both accuracy and consistency for enterprise deployment
๐Ÿ’พ Memory Management Impact Minimal performance differences between complete and summarized memory

๐Ÿ“š Citation

@misc{bogavelli2025agentarchcomprehensivebenchmarkevaluate,
      title={AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise}, 
      author={Tara Bogavelli and Roshnee Sharma and Hari Subramani},
      year={2025},
      eprint={2509.10769},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2509.10769}, 
}

๐Ÿ“„ License

AgentArch is licensed under the Apache 2.0 License.

๐Ÿ“ž Contact

For questions or collaboration opportunities:

Email


โญ If this project helps your research, please consider giving it a star! โญ

About

Code for AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages