Building Reliable Multi-Agent Workflows: Durable Execution with Validation Gates

📓 Try It Yourself

Follow the Quick Start instructions below to run this demo in minutes.

You'll need:

Temporal CLI installed

OpenAI API key

Python 3.8+

📦 Download: durable-agentic-workflows-demo.zip

Section I: Why Agentic Workflows Fail in Production

At Yess, we build AI agents that perform actions in your CRM.

A key challenge is understanding each customer's specific CRM structure. Their unique objects, fields, relationships, and business logic - Which we do using agents.

Our initial CRM agents were highly effective for small datasets. However, as we onboarded enterprise clients...

Data volumes exploded. Agentic workflows that took just a few minutes now ran 10, 20, even 45 minutes. More data meant more tool calls - querying CRM APIs, fetching schemas, analyzing records, each one another chance for failure. Each failure meant restarting from scratch, wasting time and money.

❌ The Silent Failure

The workflow completes successfully, but an agent hallucinated something midway through. Everything downstream is poisoned. Restart from scratch.

Section II: What We Needed

We couldn't keep restarting 45-minute workflows. After testing different approaches, we boiled it down to three essential requirements:

1. Durable Execution with Smart Failure Handling

Infrastructure will fail. When it does, minimize what you lose and fail fast when needed.

Checkpointing: Save progress after each step. If something breaks, resume from the last checkpoint instead of restarting from scratch.

Timeouts & Retries: Define reasonable completion times and retry limits. When exceeded, fail fast instead of waiting indefinitely.

2. Validation Checkpoints (Catch Issues Early)

An agent finishing isn't enough, we need to verify it did a good job. Validations must be:

Fast - don't slow down the pipeline
Early - catch problems before wasting downstream work
Specific - allow retries on the exact failed step, not the entire workflow
Informative - semantically analyze outputs and pass validation failure reasons to retries for targeted fixes

3. Stay Optimized

Balance reliability with efficiency - minimize execution time without compromising durability.

The Solution - Task Decomposition

The business case is straightforward: Long-running workflows are expensive to retry and validate. Task decomposition reduces both costs.

Is this solution right for you? Ask yourself:

Do your workflows run for more than a few minutes?
Can your workflows be broken into smaller, independent tasks?
Would a failure halfway through waste significant time and money?
Do you need to verify intermediate results for quality/correctness?

Answering yes to even one indicates task decomposition could benefit your workflow.

The core principle: Break complex workflows into small, discrete steps with checkpoints between them.

How to approach implementation:

Identify natural breakpoints in your workflow where you can checkpoint state (e.g., after data fetching, after each analysis phase)
Define validation criteria for each step's output, what makes it "good enough" to proceed?
Determine dependencies between steps to identify what can run in parallel
Set appropriate timeouts for each step based on expected execution time

Aspect	Before Task Decomposition	After Task Decomposition
⚠️ System Failure Impact	❌ Lose entire workflow (45 min)	✅ Lose one step (1-8 min)
🔍 Validation Failure Impact	❌ Re-run entire workflow	✅ Re-run only failed step
📊 Validation Scope	❌ Validate 2000+ lines of mixed output	✅ Validate 200 lines per step
⚡ Optimization	🐌 Everything runs sequentially	🚀 Independent steps run in parallel

Section III: Building the Workflow - A Complete Example

We'll build a simplified CRM analysis workflow that demonstrates all three requirements in action. The workflow analyzes mock CRM data (contacts and opportunities) to extract insights about customers opportunities.

What you'll see:

Durable execution - checkpointing after each step, automatic resume on failure
Validation - catching issues before they propagate
Parallel optimization - running independent analyses concurrently

Workflow Architecture

Each box is a checkpoint - if any step fails, Temporal resumes from the last successful checkpoint.

What You Need to Know Before We Begin

Component	Purpose	Type	Key Features
🏗️ Temporal	Durable execution engine	Infrastructure (requires install)	• Checkpointing & automatic retries • Timeout policies • Web UI for monitoring
🤖 Agno	AI agent framework	SDK/Library	• Lightweight & fast • Structured outputs with Pydantic • Works with any LLM provider
🔑 OpenAI API	LLM provider	API (key required)	• gpt-4o-mini for cost-effective analysis • Reliable structured JSON outputs

Prerequisites

Before starting, ensure you have:

Temporal CLI installed (instructions)
Python 3.8+ installed
OpenAI API Key (from platform.openai.com)

Quick Start

Step 1: Start Temporal Server (Terminal 1)

Open a terminal and run:

Keep this terminal open - the server must stay running.

Step 2: Download and Run Workflow (Terminal 2)

Open a new terminal and run:

Watch it live: Open http://localhost:8233 to see the workflow executing in real-time.

Step 1/4: Define the Workflow (`workflow.py`)

The workflow orchestrates three processing steps with validation between them. See the full code in the repo.

💡 Key Highlights

✓ Parallel execution with asyncio.gather() - contacts and opportunities analyzed concurrently

✓ Checkpointing after each execute_activity - state persisted automatically

✓ Validation gates before final combination - catch issues early

✓ Different timeout policies - analysis gets 2 min, validation gets 30 sec

Step 2/4: Implement Activities (`activities.py`)

Activities do the actual work - using Agno AI agents to analyze data and validate outputs. See the full code in the repo.

💡 Key Highlights

✓ Structured outputs with Pydantic models - reliable JSON from LLMs

✓ Hybrid validation - deterministic checks first (fast), then AI validation

✓ Fail fast on simple errors - save AI costs by not running expensive checks

✓ Type safety - Pydantic validates response schemas automatically

✓ Retriable - if any activity fails, Temporal retries automatically

Step 3/4: Run the Workflow (`run.py`)

The runner starts a worker and executes the workflow. See the full code in the repo.

💡 Key Highlights

✓ Worker registers all activities and workflows

✓ Connects to Temporal server on localhost:7233

✓ Workflow ID enables tracking and resumption

✓ Task queue isolates different workflow types

Expected output:

Step 4/4: Explore in Temporal UI

Open http://localhost:8233 in your browser to see the workflow in action.

You'll see:

Workflows page - your workflow with ID crm-analysis-demo
Event history - every activity execution, checkpoint, retry in real-time
Parallel execution - contacts and opportunities start at the same timestamp
Activity inputs/outputs - full data for debugging

Try breaking it:

Test	How	What Happens
Validation failure	Make `validate_analysis` return `{"passed": False}`	Workflow stops early, reports failure reason
Infrastructure failure	Kill process mid-workflow (Ctrl+C), then restart	Resumes from last checkpoint, not from scratch
Activity retry	Add `raise Exception()` randomly in an activity	Temporal automatically retries the failed activity

Temporal UI

Section IV: What You Just Built

✅ Durable execution in action:

Each activity checkpointed after completion
Kill the process mid-workflow → restart resumes from last checkpoint
No repeated work, no lost progress

✅ Hybrid validation catching issues:

Deterministic checks catch 90% of errors instantly (missing fields, negatives, invalid ranges)
AI validation catches semantic issues and hallucinations
Fail fast on simple errors, save AI costs by not running expensive checks unnecessarily
Clear error messages show exactly what failed

✅ Parallel optimization:

Contacts and opportunities analyzed concurrently
Independent operations run simultaneously
Execution time cut in half

This is the exact pattern we use in production - same structure, same checkpointing, same validation gates. The only difference: real CRM APIs instead of CSV files, and actual LLM agents doing long complex work instead of mock analysis functions.

Section V: Key Takeaways

Task decomposition unlocks reliability: Breaking workflows allows for checkpointing, granular validation, and parallel execution.
Durable execution + Validation: Temporal handles infrastructure failures, while validation gates catch AI hallucinations. You need both.
Hybrid validation saves costs: Run fast, cheap deterministic checks first. Only use expensive AI validation for semantic checks.
Tailor your policies: Don't use one-size-fits-all settings. Give analysis steps longer timeouts and validation steps shorter ones.

Building Reliable Multi-Agent Workflows: Durable Execution with Validation Gates

📓 Try It Yourself

Section I: Why Agentic Workflows Fail in Production

Section II: What We Needed

1. Durable Execution with Smart Failure Handling