OFFRE LISEUSES

Une liseuse achetée = une housse offerte* jusqu'au 21 juin

Srinivas Bommena

Dernière sortie

Evaluating Gen AI Applications - A Safety And Validation Engineering Guide

Evaluating Gen AI Applications is a practical Safety and Validation Engineering guide for teams that need to make Gen AI systems measurable, observable, secure, and production-ready. It shows how to move beyond ad-hoc prompt testing and build evaluation systems that produce evidence: test results, traces, rubrics, release gates, human-review records, red-team findings, and audit-ready governance artefacts.
The book focuses on evaluating the full application, not just the model response. A RAG system must be checked for retrieval quality, source freshness, citation validity, grounding, and hallucination risk. An agentic system must be evaluated through its plan, tool calls, permissions, handoffs, cost, and final outcome. A multimodal system must be validated across image, audio, video, OCR, JSON extraction, and cross-modal reasoning.
A regulated enterprise system must produce evidence that risk, quality, and governance controls are actually working. Through the running Meridian Insurance scenario, the book shows how production Gen AI failures actually happen: stale retrieval, unsupported claims, prompt drift, weak observability, unsafe tool use, inconsistent human review, adversarial manipulation, and missing audit evidence. Each chapter turns those risks into practical engineering controls.
Inside the book, you will learn how to:Build a Safety and Validation Engineering approach for Gen AI applications. Design evaluation pipelines that produce decisions, explanations, and reusable evidence records. Evaluate RAG systems for retrieval relevance, source freshness, citation validity, faithfulness, and hallucination risk. Test agentic workflows using tool-call traces, permission boundaries, step order, escalation rules, and cost per task.
Validate multimodal AI systems involving images, video, audio, OCR, structured extraction, and cross-modal outputs. Create evaluation datasets from incidents, production traces, adversarial examples, edge cases, expert review, and holdout sets. Use observability to monitor drift, regressions, latency, cost, model changes, and production behaviour. Build structured human-in-the-loop evaluation using rubrics, calibration, adjudication, and reviewer agreement.
Apply red teaming to prompt injection, jailbreaks, prompt leakage, data exfiltration, and tool misuse. Turn evaluation results into governance evidence for audits, executive oversight, compliance, and release decisions. This book is written for AI platform teams, software engineers, ML engineers, architects, product owners, risk teams, governance leaders, and technology executives responsible for shipping Gen AI systems that must be trusted in production.
The future of Gen AI will not be won by teams that merely generate impressive outputs. It will be won by teams that can prove their systems are accurate, grounded, observable, secure, cost-aware, and operating within clearly defined boundaries. Evaluating Gen AI Applications gives you the practical Safety and Validation Engineering framework to build that proof.
Offrir maintenant
Ou planifier dans votre panier

Les livres de Srinivas Bommena