A systematic evaluation framework for agentic AI systems across diverse architectural configurations and enterprise use cases.
AgentArch provides empirical insights into how different design dimensions interact within complex multi-agent systems. This benchmark evaluates 18 distinct agentic configurations across state-of-the-art large language models, examining four critical system dimensions:
Single-agent vs. multi-agent systems |
ReAct vs. function calling approaches |
Complete vs. summarized memory management |
Mathematical reasoning and information synthesis tools |
TL;DR: No one-size-fits-all solution exists for enterprise agentic systems
Finding | Impact | ๐ |
---|---|---|
No Universal Architecture | Models demonstrate significant architectural preferences that vary by use case complexity | ๐ฏ |
Performance Gaps | Even top models achieve only 35.3% success on complex enterprise tasks and 70.8% on simpler workflows | ๐ |
Multi-Agent ReAct Limitations | Consistent underperformance across all models in multi-agent ReAct configurations | |
Reliability Challenges | Pass^K scores peak at only 6.34%, indicating fundamental gaps for production deployment | ๐จ |
# Clone the repository
git clone https://github.com/ServiceNow/AgentArch.git
cd AgentArch
# Install dependencies
pip install -r requirements.txt
# Set up environment
cp .env.example .env
# ๐ Replace placeholders with real API keys and endpoints
python -m src.run \
--mode single_agent \
--usecase requesting_time_off \
--model claude_sonnet_4 \
--agent_type function_calling \
--project test \
--debug
AgentArch/
โโโ ๐ configs/
โ โโโ ๐ง mocked_data/
โ โ โโโ requesting_time_off_mocked_tool_calls.json
โ โ โโโ triage_cases_mocked_tool_calls.json
โ โโโ โ๏ธ use_case_configs/
โ โ โโโ requesting_time_off.yaml
โ โ โโโ triage_cases.yaml
โ โโโ โ๐ prompts.yaml
โโโ ๐ src/
โ โโโ ๐ ๏ธ tools/
โ โโโ ๐ง utils/
โ โโโ ๐ค agent.py
โ โโโ ๐ metrics.py
โ โโโ โถ๏ธ run.py # Main execution script
โโโ ๐ .env.example
โโโ ๐ .gitignore
โโโ ๐ LICENSE
โโโ ๐ requirements.txt
Aspect | Details |
---|---|
๐ฏ Complexity | Basic multi-step reasoning with clear success criteria |
๐ ๏ธ Tools | 8 custom enterprise tools |
๐ค Agents | 3 specialized agents |
๐ก Challenges | Date calculations, leave balance verification, policy compliance |
Aspect | Details |
---|---|
๐ฏ Complexity | Intelligent classification and escalation decisions |
๐ ๏ธ Tools | 31 custom enterprise tools |
๐ค Agents | 9 specialized agents |
๐ก Challenges | Ambiguous request handling, context preservation, routing logic |
Provider | Models | Status |
---|---|---|
OpenAI | GPT-4.1, GPT-4o, GPT-4.1-mini, o3-mini | โ |
Meta | LLaMA 3.3 70B | โ |
Anthropic | Claude Sonnet 4 | โ |
*Framework includes support for evaluating Gemini family models as well as Qwen models
Centralized task assignment with mediated communication

Initial task assignment with direct agent-to-agent communication

Unified agent with access to all tools
Direct tool selection using native model capabilities |
Structured reasoning-action framework with explicit thought processes |
Full visibility into all previous tool calls and responses |
Condensed information sharing to manage context length |
Structured mathematical reasoning and calculations |
Information organization and analysis capabilities |
Success requires simultaneous achievement of:
- โ Correct tool selection
- โ Accurate tool arguments (100% accuracy required)
- โ Correct final decision
- Pass@1: Success rate over k=8 trials
- Pass^K: Probability of all k trials succeeding
- ๐ซ Hallucination rates (non-existent tool/agent selection)
- ๐ Tool repetition rates
- โ Missing required tools
Recommendation | Rationale |
---|---|
โ Avoid Multi-Agent ReAct | Poor performance across all tested models |
โ Use Multi-Agent for Final Decisions | Higher accuracy in decision-making despite tool selection challenges |
๐ฏ Model-Specific Architectures | Test multiple configurations rather than assuming universal optima |
๐งฎ Thinking Tools for Non-Reasoning Models | Significant performance improvements on calculation-heavy tasks for non-reasoning models |
Focus Area | Insight |
---|---|
๐ Architecture-Use Case Interaction | Models perform optimally under different architectures depending on task complexity |
โ๏ธ Reliability vs Performance | Consider both accuracy and consistency for enterprise deployment |
๐พ Memory Management Impact | Minimal performance differences between complete and summarized memory |
@misc{bogavelli2025agentarchcomprehensivebenchmarkevaluate,
title={AgentArch: A Comprehensive Benchmark to Evaluate Agent Architectures in Enterprise},
author={Tara Bogavelli and Roshnee Sharma and Hari Subramani},
year={2025},
eprint={2509.10769},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2509.10769},
}
AgentArch is licensed under the Apache 2.0 License.
For questions or collaboration opportunities:
โญ If this project helps your research, please consider giving it a star! โญ