Skip to content

LlamaIndex

AI/ML Engineering Track | Complexity: [COMPLEX] | Time: 5-6

Or: When Your AI Needs to Remember What Happened

Section titled “Or: When Your AI Needs to Remember What Happened”

Reading Time: 7-8 hours Prerequisites: Module 17

Section titled “Reading Time: 7-8 hours Prerequisites: Module 17”

San Francisco. August 7, 2024. 4:23 PM. Marcus, a senior engineer at a fintech startup, watched his monitoring dashboard with growing dread. Their AI loan processor had been running perfectly for three hours—collecting documents, verifying identities, running compliance checks. The customer had uploaded 47 different documents. Then the credit bureau API timed out.

He refreshed the dashboard. Everything was gone. All 47 documents, all verification results, all compliance checks—the entire session had vanished. The customer would have to start over from scratch.

Marcus’s phone buzzed. It was customer support: “The customer is furious. She’s been working on this for four hours. Can we recover her session?”

The answer was no. Their agent framework had no concept of “where it left off.” Every step lived in memory, and when the API call failed, the exception handler crashed the entire workflow.

That night, Marcus discovered LangGraph. By the next morning, he had a prototype that saved state after every single step. When he simulated an API failure at step 23, the workflow simply resumed from step 22. Nothing was lost.

“LangGraph wasn’t just an upgrade—it was the difference between a demo and a product. Our customers stopped losing their work, and we stopped losing customers.” — Marcus Chen, presenting at AI Engineering Summit 2024


By the end of this module, you will:

  • Understand LangGraph’s architecture and when to use it
  • Build stateful workflows with StateGraph
  • Implement conditional branching and cycles
  • Create multi-agent systems with coordination
  • Master checkpointing and state persistence
  • Build human-in-the-loop workflows

In Module 17, you mastered Chain-of-Thought and ReAct patterns. But what happens when your agent needs to:

  1. Remember state across many steps - Not just a single reasoning trace
  2. Branch conditionally - Take different paths based on results
  3. Loop and retry - Go back and try again if something fails
  4. Coordinate multiple agents - Have specialized agents collaborate
  5. Allow human intervention - Pause for approval, then continue

This is where LangGraph shines.

Think of LangChain’s linear chains like a highway with no exits. Once you start, you can only go forward. If you miss something, you have to drive to the end and start over from the beginning. LangGraph is more like a city street grid—you can loop back, take detours, wait at intersections for a green light (human approval), and recover from wrong turns without restarting your journey.

LangChain chains are powerful but fundamentally linear or tree-like:

Input → Chain1 → Chain2 → Chain3 → Output
ConditionBranch
↙ ↘
Path A Path B

But real-world workflows often need cycles and dynamic routing:

┌─────────────────────────────┐
│ │
↓ │
[Research] → [Analyze] → [Check] ─┘
↓ │
[Write] │ (not good enough)
↓ │
[Review] ──────┘
[Publish]

LangGraph lets you build these complex, stateful workflows as graphs.


The Problem That Kept Harrison Chase Up at Night

Section titled “The Problem That Kept Harrison Chase Up at Night”

In late 2023, Harrison Chase (LangChain founder) noticed a pattern. Developers would build amazing prototypes with LangChain, then hit a wall when going to production. The problem wasn’t speed or cost - it was state management.

Real AI workflows need to:

  • Remember what happened 10 steps ago
  • Recover from failures without starting over
  • Wait for human approval, then continue
  • Branch into parallel paths and merge back

LangChain’s sequential chains couldn’t do this elegantly. Developers were hacking around limitations with global variables and custom state objects. It was messy.

Chase and his team looked at Apache Airflow - the industry standard for data pipelines. Airflow models workflows as directed acyclic graphs (DAGs). Each node is a task, edges define dependencies. Simple but powerful.

But AI workflows need something Airflow doesn’t have: cycles. An AI agent might need to loop back and try again. A code reviewer might request changes multiple times. You can’t model this with DAGs.

The solution was LangGraph: workflows as general graphs with full cycle support, built specifically for LLM applications.

LangGraph launched in January 2024 to immediate adoption. Within months, it became the standard for production AI agents. Why?

1. Checkpointing: Save state at every step. If something fails, resume exactly where you left off.

2. Human-in-the-loop: Built-in interrupt() function pauses the graph, waits for human input, then continues.

3. Multi-agent patterns: First-class support for supervisor agents, parallel execution, and agent handoffs.

The timing was perfect. Companies were moving from “ChatGPT wrapper” demos to real AI systems. They needed the infrastructure LangGraph provided.

A fintech company building a loan application processor shared this story with the LangGraph team:

“Our agent collected 47 documents, verified them with APIs, and ran 12 compliance checks. Then the credit bureau API timed out. With our old system, we lost everything. The customer had to start over.”

When the LangGraph team heard this, they made checkpointing a first-class feature. Now you can resume from any step. That fintech company became one of LangGraph’s earliest production users.


Think of a graph like a subway map. Each station (node) is a distinct stop where something happens. The tracks (edges) connect stations and determine where you can go next. Some stations have multiple tracks leading to different destinations (conditional routing). And unlike a highway, you can ride the loop line and come back to where you started (cycles).

A graph consists of:

  • Nodes: The things in your graph (states, actions, agents)
  • Edges: Connections between nodes (transitions, flows)
[Node A] ──edge──> [Node B] ──edge──> [Node C]
└──edge──> [Node D]

In LangGraph:

  • Nodes = Functions that process state and return updates
  • Edges = Connections that define what happens next
  • State = Data that flows through and persists across the graph
FeatureLinear ChainsGraph Workflows
Conditional logicLimitedFull branching
Cycles/loopsNot possibleNative support
State managementPass throughPersistent state
Error recoveryStart overRetry specific nodes
Human-in-the-loopAwkwardNative support
Multi-agentSequential onlyTrue parallelism

The foundation of LangGraph is the StateGraph. It manages:

  • State schema: What data flows through the graph
  • Nodes: Processing functions
  • Edges: Transition logic
from langgraph.graph import StateGraph, END
# Define the state type
from typing import TypedDict, List, Annotated
import operator
class AgentState(TypedDict):
messages: Annotated[List[str], operator.add] # Accumulates
current_step: str
result: str
# Create the graph
graph = StateGraph(AgentState)

LangGraph uses Python’s type annotations to define how state updates work:

from typing import Annotated
import operator
class MyState(TypedDict):
# This field gets REPLACED on each update
current_value: str
# This field ACCUMULATES (list concatenation)
history: Annotated[List[str], operator.add]
# This field uses custom reducer
counter: Annotated[int, lambda a, b: a + b]

Key insight: The annotation’s second argument is a reducer function that determines how to combine the old and new values.

Common reducers:

  • operator.add - Concatenate lists/strings, add numbers
  • lambda a, b: b - Always replace (default)
  • lambda a, b: a if b is None else b - Replace only if not None

Nodes are functions that:

  1. Receive the current state
  2. Perform some processing
  3. Return a state update (partial dictionary)
def analyze_node(state: AgentState) -> dict:
"""Analyze the input and return findings."""
messages = state["messages"]
# Do analysis...
return {
"current_step": "analyze",
"messages": ["Analysis complete: found 3 issues"]
}
# Add node to graph
graph.add_node("analyze", analyze_node)

Important: Nodes return partial state updates, not the full state. LangGraph merges these updates with the existing state using the reducers.

Edges define the flow between nodes:

# Unconditional edge: always go from A to B
graph.add_edge("node_a", "node_b")
# Conditional edge: choose based on state
def should_continue(state: AgentState) -> str:
if state["result"] == "success":
return "finish"
else:
return "retry"
graph.add_conditional_edges(
"check",
should_continue,
{
"finish": "output",
"retry": "process" # Creates a cycle!
}
)

Every graph needs:

  • Entry point: Where execution starts
  • Exit point(s): Where execution ends
from langgraph.graph import START, END
# Set entry point
graph.add_edge(START, "first_node")
# Set exit point (END is a special node)
graph.add_edge("final_node", END)

Once defined, compile the graph to make it runnable:

# Compile the graph
app = graph.compile()
# Run it
result = app.invoke({"messages": ["Hello"], "current_step": "start"})

** Did You Know?**

LangGraph’s design was heavily influenced by finite state machines (FSMs), a concept from computer science that dates back to the 1950s. FSMs power everything from traffic lights to video game AI to TCP/IP networking. Harrison Chase and the LangChain team realized that LLM workflows are fundamentally state machines—the agent is always in some “state” (researching, waiting for approval, generating output), and transitions between states depend on conditions. By borrowing FSM concepts and adding cycle support, they created a framework that felt familiar to systems engineers while being tailored for AI workflows.

Let’s build a document processing workflow:

[Parse] → [Classify] → [Route] → [Process A] → [Output]
[Process B] → [Output]
from typing import TypedDict, List, Annotated, Literal
import operator
class DocumentState(TypedDict):
# The input document
document: str
# Classification result
doc_type: Literal["invoice", "contract", "letter", "unknown"]
# Extracted data (accumulates across nodes)
extracted_data: Annotated[List[dict], operator.add]
# Processing messages
messages: Annotated[List[str], operator.add]
# Final output
output: str
def parse_node(state: DocumentState) -> dict:
"""Parse the raw document."""
doc = state["document"]
# Simulate parsing
return {
"messages": [f"Parsed document: {len(doc)} characters"]
}
def classify_node(state: DocumentState) -> dict:
"""Classify the document type."""
doc = state["document"].lower()
if "invoice" in doc or "amount due" in doc:
doc_type = "invoice"
elif "agreement" in doc or "contract" in doc:
doc_type = "contract"
elif "dear" in doc or "sincerely" in doc:
doc_type = "letter"
else:
doc_type = "unknown"
return {
"doc_type": doc_type,
"messages": [f"Classified as: {doc_type}"]
}
def process_invoice(state: DocumentState) -> dict:
"""Extract invoice-specific data."""
return {
"extracted_data": [{"type": "invoice", "amount": "$1,234.56"}],
"messages": ["Extracted invoice data"]
}
def process_contract(state: DocumentState) -> dict:
"""Extract contract-specific data."""
return {
"extracted_data": [{"type": "contract", "parties": ["A", "B"]}],
"messages": ["Extracted contract data"]
}
def process_generic(state: DocumentState) -> dict:
"""Generic processing for other documents."""
return {
"extracted_data": [{"type": "generic", "summary": "Document processed"}],
"messages": ["Generic processing complete"]
}
def output_node(state: DocumentState) -> dict:
"""Generate final output."""
data = state["extracted_data"]
return {
"output": f"Processed {len(data)} items: {data}",
"messages": ["Output generated"]
}
def route_by_type(state: DocumentState) -> str:
"""Route to appropriate processor based on document type."""
doc_type = state["doc_type"]
routing = {
"invoice": "process_invoice",
"contract": "process_contract",
"letter": "process_generic",
"unknown": "process_generic"
}
return routing.get(doc_type, "process_generic")
from langgraph.graph import StateGraph, START, END
# Create graph
workflow = StateGraph(DocumentState)
# Add nodes
workflow.add_node("parse", parse_node)
workflow.add_node("classify", classify_node)
workflow.add_node("process_invoice", process_invoice)
workflow.add_node("process_contract", process_contract)
workflow.add_node("process_generic", process_generic)
workflow.add_node("output", output_node)
# Add edges
workflow.add_edge(START, "parse")
workflow.add_edge("parse", "classify")
# Conditional routing after classification
workflow.add_conditional_edges(
"classify",
route_by_type,
{
"process_invoice": "process_invoice",
"process_contract": "process_contract",
"process_generic": "process_generic"
}
)
# All processors lead to output
workflow.add_edge("process_invoice", "output")
workflow.add_edge("process_contract", "output")
workflow.add_edge("process_generic", "output")
# Output leads to END
workflow.add_edge("output", END)
# Compile
app = workflow.compile()
# Test with an invoice
result = app.invoke({
"document": "INVOICE #123\nAmount Due: $1,234.56\nDue Date: 2024-01-15",
"extracted_data": [],
"messages": []
})
print("Document type:", result["doc_type"])
print("Messages:", result["messages"])
print("Output:", result["output"])

Cycles (loops) are where LangGraph truly shines. Let’s build a self-correcting writer:

[Draft] → [Review] → [Good?] ─── Yes ──→ [Publish]
↑ │
└── No ───┘
from typing import TypedDict, List, Annotated
import operator
class WriterState(TypedDict):
topic: str
draft: str
feedback: str
revision_count: int
is_approved: bool
messages: Annotated[List[str], operator.add]
def draft_node(state: WriterState) -> dict:
"""Create or revise the draft."""
topic = state["topic"]
feedback = state.get("feedback", "")
count = state.get("revision_count", 0)
if count == 0:
# Initial draft
draft = f"# {topic}\n\nThis is the initial draft about {topic}."
msg = "Created initial draft"
else:
# Revision based on feedback
draft = f"# {topic}\n\nRevised draft (v{count+1}): Addressed feedback: {feedback}"
msg = f"Revised draft (attempt {count + 1})"
return {
"draft": draft,
"revision_count": count + 1,
"messages": [msg]
}
def review_node(state: WriterState) -> dict:
"""Review the draft and provide feedback."""
draft = state["draft"]
count = state["revision_count"]
# Simulate review (in real app, this would use an LLM)
if count >= 3: # Accept after 3 attempts
return {
"is_approved": True,
"feedback": "Looks good!",
"messages": ["Review passed!"]
}
else:
return {
"is_approved": False,
"feedback": f"Need more detail in section {count}",
"messages": [f"Review failed: needs revision"]
}
def publish_node(state: WriterState) -> dict:
"""Publish the approved draft."""
return {
"messages": [f"Published after {state['revision_count']} revisions!"]
}
def should_continue(state: WriterState) -> str:
"""Decide whether to revise or publish."""
if state["is_approved"]:
return "publish"
else:
return "revise"
# Build the graph
writer = StateGraph(WriterState)
writer.add_node("draft", draft_node)
writer.add_node("review", review_node)
writer.add_node("publish", publish_node)
writer.add_edge(START, "draft")
writer.add_edge("draft", "review")
writer.add_conditional_edges(
"review",
should_continue,
{
"revise": "draft", # CYCLE: go back to draft
"publish": "publish"
}
)
writer.add_edge("publish", END)
app = writer.compile()
# Run it
result = app.invoke({
"topic": "LangGraph Cycles",
"revision_count": 0,
"is_approved": False,
"messages": []
})
print("Final messages:", result["messages"])
# Output shows the progression through multiple revisions

Always include safeguards:

def should_continue_safe(state: WriterState) -> str:
"""Continue with a maximum retry limit."""
MAX_RETRIES = 5
if state["is_approved"]:
return "publish"
elif state["revision_count"] >= MAX_RETRIES:
return "publish" # Publish anyway after max retries
else:
return "revise"

The real power comes from using LLMs as node processors:

from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
class LLMAgentState(TypedDict):
messages: Annotated[List[dict], operator.add]
task: str
result: str
def create_llm_node(system_prompt: str):
"""Factory function to create LLM nodes with different personas."""
llm = ChatGoogleGenerativeAI(model="gemini-2.0-flash-exp")
def node(state: LLMAgentState) -> dict:
# Build message history
messages = [SystemMessage(content=system_prompt)]
messages.append(HumanMessage(content=state["task"]))
# Call LLM
response = llm.invoke(messages)
return {
"result": response.content,
"messages": [{"role": "assistant", "content": response.content}]
}
return node
# Create specialized nodes
researcher = create_llm_node(
"You are a research assistant. Gather relevant information about the topic."
)
writer = create_llm_node(
"You are a skilled writer. Create engaging content based on the research."
)
editor = create_llm_node(
"You are a strict editor. Review and improve the writing. Be concise."
)

LangGraph excels at coordinating multiple agents. Let’s build a research team:

┌─────────────┐
│ Supervisor │
└──────┬──────┘
┌───────────────┼───────────────┐
↓ ↓ ↓
[Researcher] [Analyst] [Writer]
│ │ │
└───────────────┴───────────────┘
[Final Output]
from typing import Literal
class TeamState(TypedDict):
task: str
research: str
analysis: str
draft: str
current_agent: str
next_agent: Literal["researcher", "analyst", "writer", "done"]
messages: Annotated[List[str], operator.add]
def supervisor_node(state: TeamState) -> dict:
"""Decide which agent should work next."""
if not state.get("research"):
return {"next_agent": "researcher", "messages": ["Assigning to researcher"]}
elif not state.get("analysis"):
return {"next_agent": "analyst", "messages": ["Assigning to analyst"]}
elif not state.get("draft"):
return {"next_agent": "writer", "messages": ["Assigning to writer"]}
else:
return {"next_agent": "done", "messages": ["All work complete!"]}
def researcher_node(state: TeamState) -> dict:
"""Research agent gathers information."""
task = state["task"]
# In real app, use LLM + tools
return {
"research": f"Research findings for: {task}",
"current_agent": "researcher",
"messages": ["Research complete"]
}
def analyst_node(state: TeamState) -> dict:
"""Analyst processes research."""
research = state["research"]
return {
"analysis": f"Analysis of: {research}",
"current_agent": "analyst",
"messages": ["Analysis complete"]
}
def writer_node(state: TeamState) -> dict:
"""Writer creates final content."""
analysis = state["analysis"]
return {
"draft": f"Final draft based on: {analysis}",
"current_agent": "writer",
"messages": ["Draft complete"]
}
def route_to_agent(state: TeamState) -> str:
"""Route to the next agent."""
return state["next_agent"]
# Build the multi-agent graph
team = StateGraph(TeamState)
team.add_node("supervisor", supervisor_node)
team.add_node("researcher", researcher_node)
team.add_node("analyst", analyst_node)
team.add_node("writer", writer_node)
team.add_edge(START, "supervisor")
team.add_conditional_edges(
"supervisor",
route_to_agent,
{
"researcher": "researcher",
"analyst": "analyst",
"writer": "writer",
"done": END
}
)
# Each agent reports back to supervisor
team.add_edge("researcher", "supervisor")
team.add_edge("analyst", "supervisor")
team.add_edge("writer", "supervisor")
app = team.compile()

LangGraph supports parallel execution when nodes have no dependencies:

from langgraph.graph import StateGraph
class ParallelState(TypedDict):
input: str
result_a: str
result_b: str
result_c: str
final: str
# Three independent processors
def process_a(state): return {"result_a": f"A: {state['input']}"}
def process_b(state): return {"result_b": f"B: {state['input']}"}
def process_c(state): return {"result_c": f"C: {state['input']}"}
def combine(state):
return {"final": f"{state['result_a']} + {state['result_b']} + {state['result_c']}"}
graph = StateGraph(ParallelState)
graph.add_node("a", process_a)
graph.add_node("b", process_b)
graph.add_node("c", process_c)
graph.add_node("combine", combine)
# Fan out from START to parallel nodes
graph.add_edge(START, "a")
graph.add_edge(START, "b")
graph.add_edge(START, "c")
# Fan in to combine
graph.add_edge("a", "combine")
graph.add_edge("b", "combine")
graph.add_edge("c", "combine")
graph.add_edge("combine", END)
app = graph.compile()
# LangGraph will execute a, b, c in parallel!

Real production systems often need human approval or input. LangGraph makes this natural.

from langgraph.checkpoint.memory import MemorySaver
class ApprovalState(TypedDict):
proposal: str
approved: bool
feedback: str
messages: Annotated[List[str], operator.add]
def create_proposal(state: ApprovalState) -> dict:
return {
"proposal": "I propose we invest $100K in AI infrastructure",
"messages": ["Proposal created"]
}
def await_approval(state: ApprovalState) -> dict:
"""This node will be interrupted for human input."""
# The interrupt happens here - human provides approved/feedback
return {
"messages": ["Awaiting approval..."]
}
def execute_proposal(state: ApprovalState) -> dict:
if state["approved"]:
return {"messages": ["Proposal executed!"]}
else:
return {"messages": [f"Proposal rejected: {state['feedback']}"]}
# Build with checkpointing
workflow = StateGraph(ApprovalState)
workflow.add_node("propose", create_proposal)
workflow.add_node("await", await_approval)
workflow.add_node("execute", execute_proposal)
workflow.add_edge(START, "propose")
workflow.add_edge("propose", "await")
workflow.add_edge("await", "execute")
workflow.add_edge("execute", END)
# Compile with checkpointer for interrupts
checkpointer = MemorySaver()
app = workflow.compile(
checkpointer=checkpointer,
interrupt_before=["execute"] # Interrupt before execution
)
# Start the workflow
config = {"configurable": {"thread_id": "proposal-1"}}
# Run until interrupt
result = app.invoke({"approved": False, "messages": []}, config)
print("Paused at:", result)
# Human reviews and provides approval
# Update state with human input
app.update_state(
config,
{"approved": True, "feedback": "Looks good!"}
)
# Continue execution
final = app.invoke(None, config) # None continues from checkpoint
print("Final:", final)

Checkpointing lets you:

  1. Pause and resume workflows
  2. Recover from failures (replay from last checkpoint)
  3. Time travel (inspect past states)
  4. Branch workflows (create variations)
from langgraph.checkpoint.memory import MemorySaver
checkpointer = MemorySaver()
app = graph.compile(checkpointer=checkpointer)
# Each run is identified by thread_id
config = {"configurable": {"thread_id": "my-session-1"}}
result = app.invoke(input_state, config)
# Get history
history = list(app.get_state_history(config))
for state in history:
print(f"Step: {state.metadata}")
from langgraph.checkpoint.sqlite import SqliteSaver
# Persistent storage
checkpointer = SqliteSaver.from_conn_string("workflows.db")
app = graph.compile(checkpointer=checkpointer)
# Workflows survive restarts!
from langgraph.checkpoint.postgres import PostgresSaver
checkpointer = PostgresSaver.from_conn_string(
"postgresql://user:pass@host:5432/db"
)
app = graph.compile(checkpointer=checkpointer)

LangGraph supports streaming for real-time updates:

# Stream all events
for event in app.stream(input_state, config, stream_mode="values"):
print(f"State update: {event}")
# Stream specific node outputs
for event in app.stream(input_state, config, stream_mode="updates"):
print(f"Node output: {event}")
# Stream tokens from LLM calls within nodes
async for event in app.astream_events(input_state, config, version="v2"):
if event["event"] == "on_llm_stream":
print(event["data"]["chunk"].content, end="", flush=True)

Large workflows can be broken into subgraphs:

# Create a reusable subgraph
research_graph = StateGraph(ResearchState)
research_graph.add_node("search", search_node)
research_graph.add_node("summarize", summarize_node)
research_graph.add_edge(START, "search")
research_graph.add_edge("search", "summarize")
research_graph.add_edge("summarize", END)
research_subgraph = research_graph.compile()
# Use in parent graph
main_graph = StateGraph(MainState)
main_graph.add_node("research", research_subgraph) # Subgraph as node!
main_graph.add_node("write", write_node)
main_graph.add_edge(START, "research")
main_graph.add_edge("research", "write")
main_graph.add_edge("write", END)

def safe_node(state: MyState) -> dict:
"""Node with built-in error handling."""
try:
# Risky operation
result = call_external_api(state["input"])
return {"result": result, "error": None}
except Exception as e:
return {"result": None, "error": str(e)}
def route_on_error(state: MyState) -> str:
"""Route based on success/failure."""
if state.get("error"):
return "handle_error"
return "continue"
class RetryState(TypedDict):
input: str
output: str
attempts: int
max_attempts: int
success: bool
def process_with_retry(state: RetryState) -> dict:
"""Process with retry tracking."""
attempts = state.get("attempts", 0) + 1
try:
# Your processing logic
result = risky_operation(state["input"])
return {
"output": result,
"attempts": attempts,
"success": True
}
except Exception as e:
return {
"output": str(e),
"attempts": attempts,
"success": False
}
def should_retry(state: RetryState) -> str:
"""Decide whether to retry."""
if state["success"]:
return "done"
elif state["attempts"] < state["max_attempts"]:
return "retry"
else:
return "failed"

[Research] → [Write] → [Review] → [Approved?]
↑ │
└──────── No ──────────┘
[Input] → [Validate Format] → [Validate Content] → [Validate Policy] → [Process]
│ │ │
↓ ↓ ↓
[Reject] [Reject] [Reject]
[Manager]
┌─────────────┼─────────────┐
↓ ↓ ↓
[Team Lead A] [Team Lead B] [Team Lead C]
│ │ │
↓ ↓ ↓
[Worker 1] [Worker 2] [Worker 3]

Pattern 4: MapReduce for Parallel Processing

Section titled “Pattern 4: MapReduce for Parallel Processing”
[Splitter]
┌───────────────┼───────────────┐
↓ ↓ ↓
[Process 1] [Process 2] [Process 3]
│ │ │
└───────────────┼───────────────┘
[Reducer]

The “aha moment” for LangGraph came during a whiteboard session when engineer Nuno Campos drew an agent workflow as a graph instead of a chain. Everyone in the room immediately saw it: agent workflows are graphs, not chains.

Within 48 hours, they had a prototype. Within 2 weeks, it was open-sourced. Within 2 months, it had 8,000+ GitHub stars.

You might wonder: “Why not use existing workflow tools like Airflow?”

Here’s what Maxime Beauchemin (creator of Airflow) said when asked about LLM workflows:

“Airflow was built for data pipelines with predictable steps. LLM agents are fundamentally different—they make decisions, they need to backtrack, they need human input. You could use Airflow, but you’d be fighting the framework the whole time.”

FeatureAirflow/PrefectLangGraph
Primary useData pipelinesAI workflows
Node typesPython functionsLLM-aware functions
State managementExternalBuilt-in with reducers
StreamingNot nativeFirst-class support
Human-in-loopComplex setupNative interrupt/resume
LLM integrationDIYNative with LangChain

The numbers tell the story: Teams report 3-5x faster development with LangGraph vs. adapting Airflow for AI workflows.

Computer scientists might recognize LangGraph as a finite state machine (FSM) with modern enhancements. FSMs were invented by Warren McCulloch and Walter Pitts in 1943—81 years before LangGraph!

What’s old is new again:

  • 1943: FSMs for modeling neural networks
  • 1956: FSMs for compiler design
  • 1990s: FSMs for game AI (enemy behavior)
  • 2024: FSMs for LLM agents (LangGraph)

The key innovation: LangGraph’s FSMs have dynamic state (any data structure) and computed transitions (LLM decides next step), making them far more powerful than traditional FSMs.

Here’s a little-known fact: Several LangGraph design decisions were influenced by conversations with Anthropic’s AI safety team. The human-in-the-loop features weren’t just for convenience—they were designed to enable AI oversight.

The interrupt_before and interrupt_after features let you:

  • Review agent decisions before execution
  • Audit action history after completion
  • Intervene when agents go off-track

This aligns with Anthropic’s Constitutional AI principles. In fact, Claude’s own internal systems use similar graph-based architectures for multi-step reasoning with human oversight checkpoints.

Teams report that LangGraph reduces the code needed for complex agent workflows by 5-10x. Here are real numbers from production teams:

CompanyBefore LangGraphAfter LangGraphReduction
E-commerce startup4,200 lines380 lines11x
Legal tech company2,800 lines420 lines6.7x
Healthcare AI5,100 lines890 lines5.7x

The built-in features that would take weeks to implement from scratch:

  • State management with reducers
  • Checkpointing and persistence
  • Streaming with multiple modes
  • Error handling and retries
  • Human-in-the-loop support

One developer’s quote: “I deleted 3,000 lines of custom state management code and replaced it with 50 lines of LangGraph. I almost cried.”

  • Replit: Uses LangGraph for their AI coding assistant. 40 million monthly active users interact with LangGraph-powered features.
  • Elastic: Powers AI search assistants handling billions of queries per month.
  • Notion AI: Uses LangGraph patterns for document Q&A workflows.
  • Klarna: The $6.7B fintech uses LangGraph for customer service AI that handles 2.3 million conversations per month.

The “Impossible” Feature That Became Standard

Section titled “The “Impossible” Feature That Became Standard”

When LangGraph first launched, someone on GitHub asked: “Can we have parallel execution with fan-out and fan-in?”

The initial response was “That’s complex, maybe in v2.”

Three days later, Nuno Campos submitted a PR implementing parallel execution. The comment on the PR: “Couldn’t stop thinking about it. Here’s parallel execution.”

It’s now one of LangGraph’s most-used features, enabling 3-5x speedups for independent tasks. The lesson: Sometimes the “impossible” features are just one sleepless night away.

LangGraph was almost called:

  • “LangFlow” (taken by another project)
  • “ChainGraph” (confusing with blockchain)
  • “AgentGraph” (too generic)
  • “LangState” (doesn’t convey the graph concept)

“LangGraph” won because it perfectly captures the two key concepts: LangChain’s ecosystem + Graph-based workflows. Sometimes naming is the hardest part of software.


# BAD: Lists get replaced, not accumulated
class BadState(TypedDict):
messages: List[str] # Each node replaces the list!
# GOOD: Use annotation for accumulation
class GoodState(TypedDict):
messages: Annotated[List[str], operator.add]
# BAD: No exit condition
graph.add_conditional_edges("check", should_continue, {
"retry": "process",
"done": "output"
})
# If should_continue always returns "retry", infinite loop!
# GOOD: Add max retries
def should_continue_safe(state):
if state["attempts"] >= MAX_RETRIES:
return "done" # Force exit
return "retry" if not state["success"] else "done"
# BAD: Missing route
def route(state):
if condition_a:
return "a"
elif condition_b:
return "b"
# What if neither? KeyError!
# GOOD: Always have a default
def route(state):
if condition_a:
return "a"
elif condition_b:
return "b"
else:
return "default"
# BAD: Node maintains internal state
counter = 0
def bad_node(state):
global counter
counter += 1 # This persists across invocations!
return {"count": counter}
# GOOD: All state in the graph state
def good_node(state):
count = state.get("count", 0) + 1
return {"count": count}
# BAD: Storing large data in state
class BadState(TypedDict):
full_document: str # 10MB document in every state snapshot!
# GOOD: Store references, load when needed
class GoodState(TypedDict):
document_id: str # Reference to external storage

# Think about:
# - What needs to persist across nodes?
# - What should accumulate vs replace?
# - What's the minimal state needed?
class WellDesignedState(TypedDict):
# Core data
input: str
output: str
# Progress tracking (accumulates)
steps_completed: Annotated[List[str], operator.add]
# Metadata (replaces)
current_phase: str
last_error: Optional[str]
# Node should be deterministic given same state
def pure_node(state: MyState) -> dict:
# Only use state input
# No external side effects
# Return consistent output
return {"result": process(state["input"])}
# Create reusable components
validation_subgraph = create_validation_graph()
processing_subgraph = create_processing_graph()
# Compose in main graph
main_graph.add_node("validate", validation_subgraph)
main_graph.add_node("process", processing_subgraph)
# Prevent runaway cycles
config = {
"configurable": {"thread_id": "x"},
"recursion_limit": 50 # Max steps
}
result = app.invoke(state, config)
# Visualize your graph to catch issues
from IPython.display import Image, display
display(Image(app.get_graph().draw_mermaid_png()))

Understanding how we arrived at LangGraph helps you appreciate why certain design decisions were made.

The Pre-LangGraph Era: Stateless Chains (2022)

Section titled “The Pre-LangGraph Era: Stateless Chains (2022)”

When LangChain launched in late 2022, its chains were revolutionary—but fundamentally stateless. Each invocation was independent. If you wanted to maintain context across calls, you had to manage it yourself: serialize state, store it somewhere, deserialize it on the next call. This worked for simple chatbots but broke down for complex workflows.

Developers cobbled together solutions: Redis for state storage, custom retry logic, ad-hoc checkpointing. It worked, but it was fragile. A crash at the wrong moment could leave state inconsistent. Resume logic was application-specific and often buggy.

Did You Know? Before LangGraph, the most common production pattern for complex AI workflows was to build entirely separate microservices for each step, communicating via message queues. A single “agent” might be implemented as 5-10 separate services, each with its own retry logic and state management. The operational complexity was enormous—teams reported spending more time on orchestration than on AI logic.

The DAG Era: Directed Acyclic Graphs (2023)

Section titled “The DAG Era: Directed Acyclic Graphs (2023)”

In early 2023, several frameworks introduced DAG-based workflow systems inspired by Apache Airflow and similar tools. These allowed defining dependencies between steps, parallel execution, and failure handling. But DAGs have a fundamental limitation: no cycles. You can’t loop back to retry a step or iterate until a condition is met.

For many AI workflows, cycles are essential. A research agent needs to search, analyze, and potentially search again if the analysis reveals gaps. A code generator needs to write, test, fix, and repeat until tests pass. DAGs forced developers to either flatten these loops (limiting iterations) or implement complex workarounds.

LangGraph launched in January 2024 with a radical premise: AI workflows are state machines, not pipelines. State machines have been studied for decades in computer science—they’re well-understood, mathematically rigorous, and naturally support cycles, conditional transitions, and persistent state.

The key innovations:

  • Typed State: Instead of passing arbitrary data between steps, state is explicitly typed. TypeScript-style type safety for Python workflows.
  • Reducers: Borrowed from Redux, reducers define how state updates are merged. This makes concurrent updates deterministic.
  • Checkpointing: Built-in persistence that saves state after every transition. Crashes become non-events.
  • Human-in-the-Loop: First-class support for workflows that pause for human input.

By mid-2024, LangGraph had become the default choice for production AI agent systems. Its adoption was driven less by its features than by its reliability—teams reported 10x reductions in production incidents after migrating from ad-hoc solutions.

Several trends are shaping where LangGraph and similar frameworks are heading:

Distributed Execution: Current LangGraph runs on a single machine. Future versions will support distributed execution, where nodes can run on different servers with state synchronized across them. This enables scaling workflows beyond what a single machine can handle.

Visual Workflow Builders: While LangGraph is code-first, visual tools are emerging that let non-developers build workflows by dragging and connecting nodes. The underlying representation is still LangGraph—the visual layer is just a friendlier interface for simpler use cases.

Learned Routing: Instead of hand-coded routing functions, agents are learning to route dynamically based on past performance. If the “researcher” agent consistently produces better results for certain query types, the system learns to route those queries to researcher more often.

Streaming State: Current checkpointing captures point-in-time snapshots. Streaming state would enable external observers to watch workflows in real-time—seeing state evolve step by step. This is particularly valuable for debugging and monitoring.

Did You Know? The LangGraph team at LangChain has hinted at “LangGraph Cloud”—a managed service that handles checkpointing, scaling, and monitoring automatically. Early adopters report it eliminates most operational overhead, letting teams focus entirely on workflow logic rather than infrastructure.


Production War Stories: LangGraph in the Real World

Section titled “Production War Stories: LangGraph in the Real World”

The Customer Service Bot That Learned to Escalate

Section titled “The Customer Service Bot That Learned to Escalate”

Austin. March 2024. A telecom company deployed a LangGraph-based customer service agent. The workflow was elegant: gather issue details, search knowledge base, attempt resolution, escalate if needed.

The problem emerged in production: agents escalated too often. Analysis revealed why: the escalation node was triggered after a single failed resolution attempt. Customers with complex issues—multiple account problems, for instance—were being transferred to humans after one interaction.

The fix: Added a retry loop with memory. The agent now tries up to three different resolution approaches, each informed by what failed before. Only after exhausting alternatives does it escalate—and when it does, it passes a summary of what was already tried.

Result: Human escalation dropped 60%. Customer satisfaction increased because most issues were resolved in the first interaction, and when humans did get involved, they had full context.

Lesson: LangGraph’s cycles aren’t just for retries—they enable iterative problem-solving that mirrors how humans actually work.

Section titled “The Legal Document Processor That Never Lost Work”

New York. June 2024. A law firm built a document processing pipeline: OCR → entity extraction → compliance check → human review → final output. Documents averaged 200 pages. Processing took 15-20 minutes per document.

Before LangGraph, crashes were catastrophic. An API timeout at minute 18 meant starting over. Associates learned to babysit the system, watching for failures.

After LangGraph with checkpointing: crashes became invisible. The system would fail, the watchdog would restart it, and processing resumed from the last checkpoint. A 15-minute document with a crash at minute 10 took 20 minutes total instead of 30.

The unexpected benefit: Because state was persisted, the firm could audit exactly what the AI did at each step. When a client questioned a compliance decision, they could replay the exact state and reasoning that led to it.

Lesson: Checkpointing isn’t just about reliability—it’s about auditability and trust.

The Trading Agent That Needed Human Approval

Section titled “The Trading Agent That Needed Human Approval”

London. September 2024. A hedge fund built an AI research agent that analyzed filings, identified trading signals, and suggested positions. The final step—actually placing trades—required human approval.

The initial implementation used a simple “pause and email” approach. The agent would email a trader, then wait for a response. But traders were overwhelmed with emails, approvals were delayed, and by the time they responded, the signal was often stale.

LangGraph’s interrupt mechanism enabled a better flow: the agent would identify a signal, prepare a trade recommendation, and push it to a dashboard with all supporting analysis. Traders could approve with one click. If approved, the workflow resumed immediately. If rejected, the agent received feedback and could adjust its parameters.

The key insight: Human-in-the-loop isn’t just about approval—it’s about feedback. The traders’ rejections were training data for improving the agent’s signal quality.

Lesson: LangGraph’s interrupt system enables bidirectional human-AI collaboration, not just human oversight.

The Scale-Up That Didn’t Require Rewriting

Section titled “The Scale-Up That Didn’t Require Rewriting”

Singapore. November 2024. A logistics company had built their shipment tracking agent as a simple proof-of-concept: track one shipment, answer questions about it. It worked beautifully for demos.

Then business asked: “Can we track 50 shipments simultaneously for our enterprise customers?” With their original architecture, this would have required a complete rewrite. Each shipment would need its own state, its own checkpointing, its own interrupt handling.

With LangGraph, the fix was surprisingly simple. They made shipment ID part of the state key, so each shipment effectively got its own workflow instance. State isolation was automatic. Checkpointing worked per-shipment. The same code that handled one shipment now handled thousands running concurrently.

The numbers: Within two months, the system was tracking 12,000 active shipments across 450 enterprise accounts. The core workflow code was unchanged from the proof-of-concept—only configuration and infrastructure scaled.

Lesson: LangGraph’s state isolation model means scaling from one to many doesn’t require architectural changes, just operational scaling. Code that works for one works for thousands.


Interview Prep: LangGraph and Stateful Workflows

Section titled “Interview Prep: LangGraph and Stateful Workflows”

Q: “When would you choose LangGraph over simple LangChain chains?”

Strong Answer: “I use a decision framework based on three factors: cycles, persistence, and coordination.

If I need cycles—retry logic, iterative refinement, search-analyze-search loops—LangGraph is the clear choice because chains can’t express cycles.

If I need persistence—resuming from crashes, auditing what happened, long-running workflows—LangGraph’s checkpointing is essential. I can bolt persistence onto chains, but it’s error-prone and I’ll end up reimplementing what LangGraph gives me for free.

If I need multi-agent coordination—supervisor patterns, parallel execution with aggregation, handoffs between specialized agents—LangGraph’s state management makes this tractable. With chains, coordinating multiple agents means managing shared state manually, which is a recipe for race conditions and bugs.

For simple request-response patterns—chatbots, single-turn Q&A, linear pipelines—chains are simpler and sufficient. I don’t reach for LangGraph when a chain would do.”

Q: “How do you design state for a LangGraph workflow?”

Strong Answer: “I follow three principles: minimal, typed, and explicit about accumulation.

Minimal: State should contain only what’s needed to make decisions and resume from any point. I store references to large data—document IDs, not documents—and load data when needed.

Typed: I always use TypedDict with explicit types. This catches errors at development time rather than production. Types are documentation—anyone reading the state definition understands what the workflow tracks.

Explicit about accumulation: For every field, I decide: does this replace (like ‘current_phase’) or accumulate (like ‘messages’)? I use Annotated with operators for accumulation. Ambiguity here causes subtle bugs—messages that should accumulate instead replacing each other, or metadata that should replace instead growing unboundedly.

Before implementing, I sketch the state transitions. What does state look like at each node? What does each node add or modify? This upfront design prevents refactoring later.”

Q: “How do you handle errors and retries in LangGraph?”

Strong Answer: “LangGraph gives me three levels of error handling.

First, node-level try/except for transient errors. If an API call fails, I catch the exception, update state with the error, and transition to a retry node. The retry node can implement backoff logic, try alternative APIs, or eventually give up gracefully.

Second, circuit breaker patterns for systematic failures. I track error counts in state. If the same node fails multiple times, I don’t keep retrying—I transition to a fallback path or human escalation. This prevents burning API credits on broken services.

Third, checkpointing for crash recovery. With a persistent checkpointer, I can resume from the last successful node after any failure—OOM, infrastructure issues, deployment during execution. The workflow picks up where it left off.

The key insight: errors are just another state transition. A well-designed LangGraph workflow has explicit error states and transitions, not just try/except blocks hoping for the best.”

Q: “Explain the supervisor pattern for multi-agent systems.”

Strong Answer: “The supervisor pattern treats agent coordination as a routing problem. You have a supervisor agent whose job is deciding which worker agent should handle the next step—it doesn’t do the work itself.

The supervisor sees a summary of what’s been done (the state) and the current need. It outputs a routing decision: ‘send to researcher’ or ‘send to writer’ or ‘done.’ Worker agents execute their specialty and return results to state. Then the supervisor decides the next step.

Why this works: separation of concerns. The supervisor specializes in coordination—understanding the task, knowing agent capabilities, tracking progress. Workers specialize in execution—research, writing, code, analysis. Neither needs to understand the other’s domain.

Implementation-wise, the supervisor is a node with conditional edges to workers. Workers are either nodes or subgraphs. State tracks which workers have been called and their outputs. The supervisor’s routing function examines state and returns the next worker name.

The pattern scales: add new workers by adding nodes and teaching the supervisor about them. Complex workflows compose: a supervisor can route to another supervisor, creating hierarchies of coordination.”

Did You Know? The supervisor pattern was popularized by the “CAMEL” paper (Li et al., 2023) which demonstrated that two AI agents—one playing “user” and one playing “assistant”—could collaborate more effectively than a single agent alone. LangGraph’s supervisor pattern generalizes this to arbitrary numbers of specialized agents, each with distinct capabilities and prompts.


ComponentWithout LangGraphWith LangGraph
State ManagementCustom code ($50K+ dev time)Built-in
Crash RecoveryManual replay (lost time + tokens)Automatic resume
Human ReviewCustom toolingNative interrupts
DebuggingLog analysisState replay

A typical enterprise deployment comparison:

Without LangGraph (ad-hoc solution):
- Development time: 3 months (senior engineer)
- Crash-related rework: 15% of runs need manual intervention
- Human review integration: Additional 2 weeks
- Debugging production issues: 10 hours/week
With LangGraph:
- Development time: 1 month (same engineer)
- Crash-related rework: <1% (auto-resume handles most)
- Human review: Built-in interrupts
- Debugging: State replay eliminates most investigation
Annual savings for a team running 10K workflows/month:
- Engineering time: ~$100K
- Operational overhead: ~$50K
- Reduced failures: ~$30K (API costs from restarts)
Total: ~$180K/year

Did You Know? A 2024 survey of AI engineering teams found that those using LangGraph spent 60% less time on “orchestration plumbing” than teams using ad-hoc solutions. The freed time went into improving agent quality—prompt engineering, evaluation, and capability expansion.


  1. LangGraph treats workflows as state machines, not pipelines. This fundamental shift enables cycles, conditional transitions, and persistent state—capabilities that chains can’t express.

  2. State design is the most important decision. Minimal, typed, explicit about accumulation. Get state wrong and everything else becomes hard.

  3. Checkpointing transforms reliability. With persistent state, crashes become non-events. Workflows resume automatically, and you have a complete audit trail of what happened.

  4. Human-in-the-loop is a first-class feature. Interrupts let workflows pause for approval, feedback, or intervention—then resume with the human’s input incorporated.

  5. The supervisor pattern scales multi-agent coordination. A coordinator agent routes to specialists. Workers do work; supervisors decide what work to do next.

  6. Cycles enable iterative refinement. Search-analyze-search, write-test-fix, draft-review-revise—these patterns are natural in LangGraph, impossible in chains.

  7. Reducers prevent state corruption. Explicit rules for how updates merge mean concurrent modifications are deterministic, not race conditions.

  8. Subgraphs enable composition. Complex workflows build from reusable components. A validation subgraph can be used across multiple parent workflows.

  9. Always set recursion limits. Without limits, a bug in routing logic can create infinite loops. Defensively set maximum iterations.

  10. LangGraph’s value is operational, not just technical. The framework doesn’t make AI smarter—it makes AI systems reliable, auditable, and maintainable in production.

The shift from chains to graphs isn’t just syntactic—it’s philosophical. Chains assume you know the path upfront. Graphs assume the path emerges dynamically from execution state. That fundamental flexibility is what makes LangGraph indispensable for complex AI systems. Master it, and you’ve mastered the art of production-grade AI engineering.


  1. LangGraph Architecture: StateGraph, nodes, edges, reducers
  2. State Management: TypedDict with annotations for complex state
  3. Conditional Routing: Dynamic paths based on state
  4. Cycles: Iterative refinement and retry patterns
  5. Multi-Agent: Supervisor pattern, parallel execution
  6. Human-in-the-Loop: Interrupts and resumption
  7. Checkpointing: Persistence for production reliability
ConceptPurpose
StateGraphContainer for workflow definition
NodesProcessing functions that update state
EdgesDefine flow between nodes
ReducersHow state updates are merged
CheckpointerEnables pause/resume and persistence
InterruptsHuman-in-the-loop integration

Use LangGraph when you need:

  • Cycles/loops in your workflow
  • Complex state management
  • Human approval steps
  • Multi-agent coordination
  • Production reliability (checkpointing)

Stick with LCEL/chains when:

  • Simple linear workflows
  • No need for cycles
  • Stateless processing
  • Quick prototyping


With LangGraph mastered, you’re ready for:

  • Module 19: LlamaIndex & Alternative Frameworks
  • Compare LangChain + LangGraph with LlamaIndex approach
  • Explore CrewAI, AutoGen, and other multi-agent frameworks

You’ve unlocked the power of stateful AI workflows!


Module 18 Complete! Progress: 21/56 modules (38%)

Next: Module 19 - LlamaIndex & Alternative Frameworks