The Invisible Scaffolding: Why EVALS Are the Operating System for Agentic AI

Ajay Behuria
Aug 21
15 min read

Updated: Sep 10

Introduction: The Unfolding of Autonomy

The last few years in the technology sector have felt like the first act of a grand play. It was the era of dazzling generative AI, where we marveled at its ability to conjure prose, compose music, and create images with unprecedented speed and scale. This period was defined by the transition from a human-driven creative process to one augmented by a powerful, passive tool that could instantly produce new content. Yet, this was only the prelude. The second act has now begun, and it is a fundamental shift from a passive tool that generates content to an autonomous system that takes real-world action.This is the unfolding of Agentic AI.

Agentic AI marks a new paradigm where a system is capable of operating independently, making decisions, and interacting with its environment without constant human supervision. Consider the difference: a marketing team might use generative AI to create a single SEO-optimized blog post. The next evolution, an agentic AI, will autonomously research, draft, optimize, and publish that blog post, automatically updating the client's content management system and scheduling social media promotions in a single, end-to-end workflow.The core conceptual leap is a transition from

output creation to outcome delivery. A business does not simply purchase a blog post; it invests in a strategic outcome — an increase in organic traffic. This necessitates a new evaluation framework that assesses the success, safety, and efficiency of the entire, multi-step process, not just the final deliverable.

This inherent autonomy, while immensely powerful, introduces a new, critical risk: unpredictability. An agent's ability to reason, plan, and execute a sequence of actions means that its entire decision-making process must be understood and governed. The traditional methods of evaluation, designed for static, single-turn models, are rendered obsolete. This is the central challenge that a new class of evaluation systems, known as EVALS, must solve.

EVALS, in this context, are not a simple quality assurance chore. They are the foundational "operating system" for this new class of AI — the invisible scaffolding that ensures trust, safety, and business value by providing a framework for continuous governance and accountability. This report explores this new calculus of evaluation, the strategic decisions facing technology leaders, the market of tools emerging to meet this need, and the vision for what the future holds for this critical, invisible scaffolding.

Part I: From Prompts to Pathways—The New Calculus of Evaluation

The Agentic Shift Defined

Agentic AI distinguishes itself from its generative predecessors through a set of core characteristics that enable it to transition from advisor to operator.These include autonomy, the capacity to operate without constant human oversight; decision-making, the ability to reason and select appropriate actions from a set of options; tool use, the capability to interact with external APIs and systems to extend its functionality; and multi-step planning, the process of decomposing a high-level goal into a series of actionable, ordered tasks.

These capabilities elevate AI from a content-creation engine to a fully operational system. For instance, in a healthcare setting, traditional AI might be used to analyze medical images to assist in diagnosis. An agentic AI, in contrast, can be integrated into a system like Propeller Health's smart inhaler, which uses real-time data on medication usage and air quality to automatically track patient patterns and alert healthcare providers when necessary. Similarly, a cybersecurity agent moves beyond simply identifying threats to actively containing them by isolating compromised devices or shutting down affected services in real-time, all without requiring manual intervention.

The conceptual leap here is profound: a transition from passive output to a goal-oriented outcome. The intelligence is no longer static; it is dynamic and adaptive. A self-driving vehicle, for example, must continuously analyze its surroundings, making decisions on acceleration or braking based on real-time sensor data from radar, lidar, and cameras. It must handle shifting conditions and adapt to unpredictable edge cases, improving over time with new inputs. This necessitates an evaluation framework that can account for the entire, complex execution path, not just the final result.

Beyond Simple Accuracy: Why Traditional Metrics Fail

The dynamic and autonomous nature of agentic systems renders traditional, static evaluation metrics fundamentally insufficient. A simple pass/fail metric or a binary accuracy score, which might suffice for a static task like image classification, is meaningless for an agent that performs a multi-step workflow involving thousands of decisions and actions. The failure of traditional metrics is not a technical oversight but a direct consequence of this new architectural complexity.

Agentic systems introduce a new set of challenges that invalidate old evaluation paradigms:

Multi-step Logic and Planning Complexity: The true value of an agent lies in its ability to break down a high-level goal into a sequence of smaller, manageable tasks. This creates a multi-step execution path where a failure can occur at any point, from an incorrect initial plan to a single incorrect tool call. Evaluating only the final output provides no insight into where and why the system failed.
Tool and API Dependencies: Agents often rely on external tools, APIs, and third-party systems to complete their tasks. An agent's failure to complete a task could be the result of a faulty or slow API, not a flaw in the agent’s own reasoning. Traditional metrics cannot distinguish between a model's internal error and an external system's failure.
Dynamic and Unpredictable Environments: An agent's performance may change under shifting conditions. An agent designed to manage a smart grid must adapt to real-time data on energy consumption, while a traffic management agent must synthesize live data from GPS and road cameras to adjust timing on the fly. The system must be evaluated not on a static dataset but on its ability to perform robustly in a continuously changing, real-world environment.

The failure of traditional metrics is a direct consequence of the new architectural complexity of these systems. An agent is no longer a black box to be benchmarked; it is a transparent, auditable system whose entire execution path must be evaluated. This transforms evaluation into a form of observability and forensic analysis, requiring a deep understanding of the agent's internal reasoning and external interactions.

The Multi-Dimensional Framework for Agentic EVALS

To address these new complexities, a new, multi-dimensional evaluation framework is essential. This framework moves beyond simple accuracy to encompass a broader spectrum of performance, safety, and business-critical dimensions. It provides the foundation for deep diagnostics and continuous improvement.

Key dimensions of this new calculus include:

Effectiveness and Accuracy: This goes beyond simple task completion to include granular metrics that assess the quality of the agent's decisions. These include step-level accuracy (measuring if the agent takes the correct action at each point in a workflow), tool/action selection accuracy (evaluating whether the agent chooses the right API for a given task), and the hallucination rate (measuring how often the agent generates factually incorrect or nonsensical information).
Efficiency and Performance: This dimension directly links technical performance to financial impact. Key metrics are latency (including time to first response and reflection latency), throughput (the number of tasks completed per unit of time), and the all-important cost per task (tracking resource usage, such as tokens consumed and compute time).
Security and Safety: An agent's autonomy introduces new security vulnerabilities that go beyond traditional threat modeling. This requires evaluating for risks like goal misalignment, where an agent's objective diverges from its intended purpose, and adversarial attacks, such as data poisoning or evasion attacks, that can manipulate the agent's behavior.
Robustness and Stability: The ability of an agent to maintain consistent performance over time, even under dynamic conditions, is critical. Key metrics include consistency across executions and the execution error rate. This ensures that the agent's planning and execution do not fail under pressure.

The proliferation of these new, granular metrics indicates a fundamental shift towards system-level observability. An agent is no longer a monolithic entity; it is a series of auditable decisions and actions. The new evaluation framework is a direct reflection of this need, enabling deep diagnostics and providing the foundation for enterprise trust.

Table 1: The New Calculus of AI Evaluation

Dimension	Traditional GenAI Metrics	Agentic AI Metrics
Accuracy	BLEU, ROUGE, F1 Score	Task Success Rate, Step-level Accuracy, Tool/Action Selection Accuracy, Hallucination Rate, Precision, Recall
Performance	Latency per query	Time to First Response, Reflection Latency, Throughput
Cost	Token cost per output	Cost per Task (including compute, API calls)
Resilience	N/A	Consistency Across Executions, Execution Error Rate, Recovery Rate
Security	PII leakage, Toxicity	Goal Misalignment, Adversarial Evasion, Data Poisoning
Compliance	N/A	Policy Adherence, Auditable Reasoning Paths

Part II: The Architect's Dilemma—A Tradeoff Analysis for the Autonomous Age

The Strategic Fork in the Road

Technology leaders responsible for building new systems face a core dilemma: how to balance competing priorities like performance, security, and cost when developing a new system. The resources for building and maintaining a system are finite. This challenge becomes even more acute with agentic systems, where a decision to enhance one quality attribute — such as security — could introduce significant costs or degrade performance. To navigate this complexity, a leader must choose between, or combine, two distinct but complementary strategic approaches to architectural evaluation.

The Quality-First Blueprint

One strategic approach focuses on a comprehensive, risk-centric, and technical deep dive into the system's architecture. This framework evaluates an architecture based on its non-functional requirements, often referred to as "ilities". The process mirrors a well-established method for architectural evaluation, focusing on identifying architectural risks and clarifying quality attribute requirements early in the lifecycle.

The process follows a clear, logical progression:

Identify Business Drivers: The evaluation begins by understanding the core business goals and constraints that are motivating the development effort. These drivers articulate what is most important for the system's success, such as high availability, time to market, or enhanced security.
Define a Utility Tree: These business drivers are then translated into a prioritized set of "ilities" or quality attributes. A hierarchical model, a "utility tree," is created to break down these high-level attributes into concrete, specific scenarios that can be tested. For example, a high-level goal of "usability" might be broken down into scenarios related to the speed of response or the ease of use for a specific user role.
Analyze Against Scenarios: The core of the analysis involves using real-world scenarios to test how the proposed architecture performs against the prioritized quality attributes. A security scenario might simulate a data poisoning attack on the agent's training data, while a performance scenario might involve a load test to determine latency under stress.
Identify Tradeoffs and Risks: The primary value of this approach lies in its ability to identify not only architectural risks but also the inherent tradeoffs between different quality attributes. The analysis reveals where different attributes conflict, such as when a decision to increase security by adding layers of encryption negatively impacts performance and latency. The resulting insights clarify requirements and increase communication among stakeholders, identifying potential issues before they become costly to fix later in the development cycle.

This approach is a risk-centric, technical deep dive. Its primary value is in identifying architectural weaknesses and interdependencies early in the lifecycle, providing the blueprint for a resilient and sound system.

The Economic ROI Model

A second, complementary strategic framework focuses on a financial cost-benefit analysis of architectural decisions. This approach, designed to maximize return on investment (ROI), shifts the conversation from technical elegance to bottom-line impact. The process guides technology leaders through a systematic way of weighing potential benefits against real development costs, risks, and schedule implications.

The process involves a series of quantified steps:

Choose Scenarios and Architectural Strategies: The first step is to identify specific agentic capabilities that are highest on the priority list and the architectural strategies that could be used to implement them. For instance, a high-reliability goal could be achieved with a strategy of using redundant hardware or with a different strategy of using checkpointing and recovery mechanisms.
Quantify Benefits: This step involves assigning a monetary value to the benefits that each architectural strategy is expected to deliver. These benefits can be tangible, such as reduced operational costs or increased revenue from a new feature, or intangible, like the value of mitigated risk or improved user satisfaction.
Quantify Costs: All costs associated with a given strategy are tallied. This goes beyond initial development costs to include ongoing maintenance, infrastructure, and even the operational cost of the AI model, such as the number of tokens consumed per task.
Calculate Desirability: The final step involves comparing the quantified costs and benefits over time to determine the net present value (NPV) and overall ROI of each architectural option. This helps leaders make data-driven investment decisions and forces a consideration of long-term maintenance costs that are often overlooked.

This approach is a business-centric, financial justification. It enables leaders to explore the effects of different options using economic software models, ensuring that finite resources are allocated to maximize gains, meet schedules, and minimize risks.

Synthesis and Synergy: The Virtuous Cycle of Evaluation

The two frameworks are not a zero-sum game; they are synergistic and form a virtuous cycle of evaluation. The quality-first blueprint provides the necessary technical foundation. It identifies the architectural options and flags inherent risks and tradeoffs, giving the business a clear understanding of what is technically feasible and resilient. Once these options have been identified, the economic ROI model can be used to analyze them, selecting the one that delivers the most business value within the acceptable risk and cost parameters.

The synergy between these two approaches is paramount for mature organizations. The quality-first approach is the domain of the Chief Architect, focused on building a resilient and sound system from a technical perspective. The economic ROI model is the domain of the CIO or CTO, focused on maximizing the return on every investment. A mature organization uses both in concert to make truly informed, high-impact decisions that balance engineering rigor with business pragmatism.

Table 2: The Two Sides of Agentic Evaluation

Aspect	The Quality-First Blueprint	The Economic ROI Model
Primary Focus	Technical risk and architectural integrity	Financial return and business value
Key Metrics	"Ilities" (Reliability, Availability, Security, etc.), Utility Tree, Risks, Tradeoffs	NPV, ROI, Cost per Task, Quantified Benefits
When to Use	Early in the software development lifecycle	When making investment decisions between architectural options
Resulting Insights	Architectural weaknesses, inter-attribute tradeoffs, clarified requirements	Investment priorities, long-term cost implications, maximized ROI

Part III: The Market Scan — Tools of the Trade

The Framework Forge: Building the Autonomous Workforce

The market for agentic AI frameworks is rapidly evolving, with a clear trend from building individual, single-purpose agents to orchestrating complex, multi-agent collaborations. These open-source frameworks provide the foundational infrastructure for developers to build the autonomous workforce of the future.

Microsoft AutoGen: This open-source framework is designed for building multi-agent AI applications that can perform complex tasks. It has a layered architecture, with a Core programming framework for scalable, distributed networks of agents. It is particularly focused on enterprise-grade reliability and provides advanced error handling and logging capabilities. A key feature is AutoGen Bench, a tool for assessing and benchmarking agentic AI performance, reflecting a built-in focus on evaluation from the ground up.
CrewAI: This popular orchestration framework uses a role-based architecture, treating agents as a "crew" of "workers" with specialized roles and backstories. Its core components — agents, tasks, and a process — allow developers to define how a crew collaborates, either sequentially or hierarchically. For example, a stock market analysis crew might have a market analyst agent, a researcher agent, and a strategy agent working in a pre-defined order to produce a final action plan.
LangGraph and LangChain: LangChain is a foundational framework useful for developing straightforward AI agents, with support for vector databases and memory. Built on top of it, LangGraph provides a graph-based architecture suitable for more complex, non-linear workflows. This architecture offers fine-grained control and state tracking, making it ideal for cyclical or conditional processes. LangChain's LangSmith platform is an important tool for debugging, testing, and performance monitoring, providing observability into the agent's operations.

The market for frameworks is evolving from single-agent libraries to a focus on multi-agent collaboration and orchestration. This mirrors the enterprise need for specialized teams and complex workflows. The varying architectures (chat-based, graph-based, and role-based) reflect different approaches to solving the core challenge of multi-step complexity, providing developers with a range of tools tailored to their specific use cases.

The Evaluation Platforms: The Trust and Confidence Enablers

A significant indicator of market maturity is the emergence of specialized platforms designed exclusively for agent evaluation. The market is bifurcating: some companies build the agents, while others build the tools to ensure their safety, reliability, and governance. This specialization implies that the problem of evaluation is too complex to be an afterthought or a simple DIY project.

Galileo AI: Positioned as a comprehensive evaluation platform, Galileo automates evaluations to eliminate the time spent on manual reviews. Its features include real-time protection to block hallucinations, PII leakage, and prompt injections directly in production, ensuring 100% sampling. The platform provides tools for rapid iteration and debugging, helping developers identify failure modes and their root causes. The ability to de-risk AI in production by integrating unit testing and CI/CD rigor into AI workflows positions it as a full-lifecycle solution.
Patronus AI: This platform offers a powerful suite of evaluation and optimization tools, including industry-leading evaluation models for RAG hallucinations, image relevance, and context quality. It supports custom evaluators and provides a suite of industry-standard datasets like FinanceBench for financial question answering and SimpleSafetyTests for safety risks. Patronus's key value proposition lies in its ability to continuously capture evaluations and highlight failures in production logs, allowing for side-by-side benchmarking of different LLMs and agents. The platform's direct integration with frameworks like CrewAI further streamlines the evaluation process.

The fact that these platforms offer production monitoring, not just pre-deployment testing, signifies a shift to continuous, real-time evaluation. They provide the necessary tools for companies to move from isolated experiments to reliable, production-grade systems.

Table 3: The Agentic AI Ecosystem at a Glance

Framework/Platform	Core Philosophy	Key Features	Ideal Use Case
AutoGen	Conversational multi-agent systems	Layered architecture, enterprise reliability, built-in benchmarking (AutoGen Bench)	Production-grade systems, multi-agent collaboration
CrewAI	Role-based orchestration, "Crew" of "Workers"	Natural language agent/task definitions, sequential or hierarchical processes	Straightforward, multi-agent workflows
LangGraph/Chain	Graph-based architecture for complex workflows	Fine-grained control, state tracking, large ecosystem of integrations	Complex, non-linear, and cyclical workflows
Galileo AI	Automated, real-time evaluation	CI/CD rigor, automated evaluations, real-time production protection	De-risking AI in production, rapid iteration
Patronus AI	Comprehensive evaluation and optimization	Specialized evaluators (RAG, hallucinations), custom evaluators, industry datasets	Benchmarking models, ensuring safety and compliance

Part IV: Beyond Today—Maturity and the Vision of the Future

The Turning Point: From Experimentation to Enterprise

The market for agentic AI is poised for explosive growth, with multiple reputable research firms projecting a Compound Annual Growth Rate (CAGR) ranging from 41% to 57% by 2030, with a market size reaching between $24.5 billion and $48.2 billion. This projection is not just a forecast of technological adoption; it is an estimate of a massive economic shift, with a potential contribution of trillions of dollars annually to global GDP by 2030. This signals a profound transition from technological curiosity to strategic business imperative.

The drivers behind this growth are clear. Organizations are increasing their IT budgets for generative AI capabilities, recognizing that agentic workflows are a "new consumption framework" that unlocks compounding value from their investments. This creates a powerful tension: how do we scale safely and responsibly at this speed? The question elevates the importance of EVALS from a technical detail to a critical enabler of this massive growth. A large majority of organizations already believe that agentic AI will be disruptive to their business operating models in the near future, making the establishment of robust evaluation frameworks a matter of competitive survival.

The Future of EVALS: Towards Ubiquitous, Real-Time Governance

The future of evaluation is not a matter of pre-deployment testing. As agents become more autonomous and are integrated into mission-critical systems, EVALS will become a continuous, real-time function embedded directly into the system's architecture. This necessitates a new paradigm of evaluation focused not just on individual agents but on the entire system as a whole.

This new paradigm will have to address the emerging evaluation challenges that accompany increasingly complex agentic systems:

Multi-Agent Dynamics: The future of agentic AI is not about a single, super-powerful agent but about networks of collaborating agents. This requires new evaluation metrics to assess inter-agent trust, identify potential collusion between agents, or understand how agents might exploit each other's weaknesses in a competitive environment. The challenge is to ensure that a "crew" of agents works harmoniously towards a common goal, not just that each individual agent performs its task.
Error Propagation Analysis: In a multi-agent system, an error or hallucination in one agent's output can cascade through the entire workflow, creating a "ripple effect" of failure. Future EVALS must be able to trace this propagation and identify the root cause of the error in the complex web of interactions.
Workflow Resilience Testing: The ability of an agentic system to recover gracefully from disruptions and input failures is paramount. Future evaluation frameworks will need to go beyond simple success/failure metrics to simulate real-world disruptions and test the system's capacity for self-correction and recovery.

This is the final frontier for EVALS: governing not just a single, autonomous entity but the collective intelligence of an entire network.

Conclusion: The Invisible Scaffolding

The shift to agentic AI is the most significant technological evolution since the advent of the internet. It promises to transform entire industries by automating complex, multi-step workflows that were previously the exclusive domain of human cognition. But with this newfound autonomy comes an equally profound challenge: how do we ensure that these systems operate safely, reliably, and in alignment with our intentions?

The answer lies in a new class of evaluation frameworks. EVALS are the invisible scaffolding upon which the new era of agentic autonomy is being built. They are the operating system that makes these systems observable, auditable, and, most importantly, trustworthy. They enable technology leaders to move beyond the strategic paralysis of a simple tradeoff analysis by providing both the technical and financial data needed to make informed decisions.

For the modern technologist, embracing EVALS is not a compliance burden but a strategic imperative. It is how we move from a "wizard" who creates magic to a "master engineer" who builds enduring systems. The future belongs not to the one who can build the most brilliant agent but to the one who can build the most reliable and transparent evaluation framework to govern it. This is the path to truly scalable, responsible, and transformative AI.

Works cited

Agentic AI vs. Generative AI - IBM, accessed August 21, 2025, https://www.ibm.com/think/topics/agentic-ai-vs-generative-ai
14 real-world agentic AI use cases - Valtech, accessed August 21, 2025, https://www.valtech.com/thread-magazine/14-real-world-agentic-ai-use-cases/
www.techaheadcorp.com, accessed August 21, 2025, https://www.techaheadcorp.com/blog/agentic-ai-evaluation-ensuring-reliability-and-performance/#:~:text=Agentic%20AI%20evaluation%20is%20the,as%20they%20are%20intended%20to .
Agent Factory: The new era of agentic AI—common use cases and ..., accessed August 21, 2025, https://azure.microsoft.com/en-us/blog/agent-factory-the-new-era-of-agentic-ai-common-use-cases-and-design-patterns/
Agentic AI Threat Modeling Framework: MAESTRO | CSA - Cloud Security Alliance, accessed August 21, 2025, https://cloudsecurityalliance.org/blog/2025/02/06/agentic-ai-threat-modeling-framework-maestro
What is AI Agent Evaluation? | IBM, accessed August 21, 2025, https://www.ibm.com/think/topics/ai-agent-evaluation
AI Agent Evaluation: Key Steps and Methods | Fiddler AI Blog, accessed August 21, 2025, https://www.fiddler.ai/blog/ai-agent-evaluation
Agentic AI: On Evaluations | Towards Data Science, accessed August 21, 2025, https://towardsdatascience.com/agentic-ai-evaluation-playbook/
Evaluating Agentic AI in the Enterprise: Metrics, KPIs, and Benchmarks - Auxiliobits, accessed August 21, 2025, https://www.auxiliobits.com/blog/evaluating-agentic-ai-in-the-enterprise-metrics-kpis-and-benchmarks/
What is AI Agent Evaluation: A CLASSic Approach for Enterprises - Aisera, accessed August 21, 2025, https://aisera.com/blog/ai-agent-evaluation/
Architecture Tradeoff Analysis Method Collection, accessed August 21, 2025, https://www.sei.cmu.edu/library/architecture-tradeoff-analysis-method-collection/
Cost Benefit Analysis Method (CBAM) - Software Engineering Institute, accessed August 21, 2025, https://www.sei.cmu.edu/documents/2540/2018_010_001_513478.pdf
List of system quality attributes - Wikipedia, accessed August 21, 2025, https://en.wikipedia.org/wiki/List_of_system_quality_attributes
Reasoning About Software Quality Attributes, accessed August 21, 2025, https://www.sei.cmu.edu/library/reasoning-about-software-quality-attributes/
Architecture tradeoff analysis method - Wikipedia, accessed August 21, 2025, https://en.wikipedia.org/wiki/Architecture_tradeoff_analysis_method
What is a cost-benefit analysis (CBA)? - Atlassian, accessed August 21, 2025, https://www.atlassian.com/work-management/strategic-planning/cost-benefit-analysis
Maximizing ROI in Architectural Projects - Number Analytics, accessed August 21, 2025, https://www.numberanalytics.com/blog/cost-benefit-analysis-architectural-management
AI Agent Frameworks: Choosing the Right Foundation for Your ... - IBM, accessed August 21, 2025, https://www.ibm.com/think/insights/top-ai-agent-frameworks
Top Agentic AI Tools and Frameworks for 2025 - Anaconda, accessed August 21, 2025, https://www.anaconda.com/guides/agentic-ai-tools
Galileo AI: The Generative AI Evaluation Company, accessed August 21, 2025, https://galileo.ai/
Patronus AI | Powerful AI Evaluation and Optimization, accessed August 21, 2025, https://www.patronus.ai/
Patronus AI Evaluation - CrewAI Docs, accessed August 21, 2025, https://docs.crewai.com/observability/patronus-evaluation
How to Monitor, Evaluate, and Optimize Your CrewAI Agents : r/AIQuality - Reddit, accessed August 21, 2025, https://www.reddit.com/r/AIQuality/comments/1m0bdps/how_to_monitor_evaluate_and_optimize_your_crewai/
Agentic AI: A Strategic Forecast and Market Analysis (2025-2030) - Prism Media Wire, accessed August 21, 2025, https://prismmediawire.com/agentic-ai-a-strategic-forecast-and-market-analysis-2025-2030/
Scaling AI Maturity with Agentic AI Adoption eBook - IDC, accessed August 21, 2025, https://info.idc.com/rs/081-ATC-910/images/IDC-Scaling-AI-Maturity-with-Agentic-AI-AP.pdf?version=1