Shrikant Wagh

Dernière sortie

Evaluating RAG and Agentic AI Systems - Failure Taxonomy & Contracts

Your test suite is green. Your CI pipeline passed. And your agentic AI system just leaked customer data in production. This is the crisis no one warned you about - unfolding right now across every industry deploying RAG and agentic AI systems without the tools to truly test them. A fintech agent leaks customer records through a manipulated tool description. An enterprise RAG pipeline silently cross-contaminates tenant data without raising a single exception.
A model update quietly shifts agent behavior in ways no test ever caught. These aren't software bugs. They're a new category of failure - and conventional testing was never built to catch them. Evaluating RAG and Agentic AI Systems - Failure Taxonomy & Contracts is the definitive answer to that gap. Written by Shrikant Wagh - a veteran of over three decades in software quality, co-founder of a patented testing tools company, and IIT Madras alumnus - this framework gives engineering teams the language, architecture, and working code to test agentic AI with mission-critical rigor.
Not through informal spot-checking. Through deterministic, CI-gateable, production-grade contracts. At the heart of the book is the Eleven Contract Taxonomy: behavioral invariants covering every critical failure surface - Knowledge, Retrieval, Generation, Agent and Tool, Skill, Protocol, Security, Operational, Multi-Agent, Multi-Modal, and Fine-Tuning. These contracts give you testable, automatable assertions for catching failure before it reaches your users.
When your system is non-deterministic, contracts need muscle. The MITM Testing Pattern delivers it - using fake retrievers, fake LLMs, in-process MCP clients, and in-memory tracers to inject precise control at every agent boundary. Write deterministic tests for probabilistic systems, isolate every layer, and assert correctness - without expensive live model calls. On top of this sits a complete production evaluation stack: golden datasets, LLM-as-Judge pipelines, Recall@K, MRR, and NDCG@K metrics, regression quality gates, drift detection, and a full GitHub Actions CI pipeline - each chapter backed by real Python code and exercises.
The final chapters address the organization: a five-level maturity model, sprint-by-sprint roadmap, and Investment Decision Framework for building a sustainable testing program at scale. This is not a book about theory. It was born from real failures - MCP rug pull exploits, retrieval authorization bypass, silent hallucination, citation fabrication, multi-agent cascade failure. Each has a named contract and a test that catches it.
Not "did it pass the tests?" - but "do we have the right tests?"The systems are in production. The failures are real. Now there is a framework built to catch them. Build the contracts. Gate the pipeline. Ship with confidence.

Format

Shrikant Wagh

Evaluating RAG and Agentic AI Systems - Failure Taxonomy & Contracts

Les livres de Shrikant Wagh

Evaluating RAG and Agentic AI Systems - Failure Taxonomy & Contracts

Enterprise-Grade Test Automation Playbook

Shrikant Wagh

Evaluating RAG and Agentic AI Systems - Failure Taxonomy &amp; Contracts

Les livres de Shrikant Wagh

Evaluating RAG and Agentic AI Systems - Failure Taxonomy & Contracts

Enterprise-Grade Test Automation Playbook

Evaluating RAG and Agentic AI Systems - Failure Taxonomy & Contracts