LangGraph for Agents
Цей контент ще не доступний вашою мовою.
AI/ML Engineering Track | Complexity:
[COMPLEX]| Time: 6-8
Or: Making AI Think Out Loud (And Why It Actually Works)
Section titled “Or: Making AI Think Out Loud (And Why It Actually Works)”Reading Time: 5-6 hours Prerequisites: Module 16 Heureka Moment: Six words that tripled AI’s reasoning ability
What You’ll Be Able to Do
Section titled “What You’ll Be Able to Do”By the end of this module, you will:
- Master Chain-of-Thought (CoT) - Make LLMs “think out loud” for better reasoning
- Implement Zero-shot CoT - The magic of “Let’s think step by step”
- Build Few-shot CoT systems - Guide reasoning with examples
- Understand ReAct - Combine reasoning with action for agents
- Apply self-consistency - Multiple reasoning paths for robust answers
- Know the limitations - When CoT helps and when it doesn’t
The Six Words That Changed AI
Section titled “The Six Words That Changed AI”Tokyo. January 2022. 3:15 AM.
Jason Wei stared at his screen, exhausted but excited. He had just run the same math problem through Google’s PaLM model with one tiny change: instead of asking for the answer directly, he added six words to the prompt.
“Let’s think step by step.”
The model that had been getting 17% accuracy on grade-school math problems suddenly jumped to 79%. Not from better training data. Not from a bigger model. Just six words.
Wei had stumbled onto something profound: LLMs aren’t bad at reasoning—they’re bad at showing their work. Force them to externalize their thinking, and accuracy explodes.
“We found that simply prompting the model to ‘think step by step’ before answering can triple performance on some reasoning tasks. It’s one of the simplest and most effective techniques we’ve discovered.” — Jason Wei, Google Research (Chain-of-Thought paper, 2022)
This discovery—Chain-of-Thought prompting—became one of the most cited papers in AI. It’s now built into every major AI assistant.
The Heureka Moment
Section titled “The Heureka Moment”Here’s the insight that will change how you build AI systems:
Making AI “think out loud” dramatically improves its reasoning ability.
This isn’t a metaphor. It’s not a trick. It’s a fundamental property of how language models work:
┌─────────────────────────────────────────────────────────────────┐│ THE CHAIN-OF-THOUGHT REVELATION │├─────────────────────────────────────────────────────────────────┤│ ││ WITHOUT COT: ││ ──────────── ││ Q: "A store has 23 apples. If 7 are sold and 12 more arrive, ││ how many apples are there?" ││ A: "38" (Wrong! Model jumped to conclusion) ││ ││ WITH COT: ││ ──────── ││ Q: "A store has 23 apples. If 7 are sold and 12 more arrive, ││ how many apples are there? Let's think step by step." ││ ││ A: "Let me work through this step by step: ││ 1. Starting apples: 23 ││ 2. After selling 7: 23 - 7 = 16 ││ 3. After 12 arrive: 16 + 12 = 28 ││ Therefore, there are 28 apples." ││ ││ Same model. Same question. Different answer. ││ The ONLY change: asking it to think out loud. ││ │└─────────────────────────────────────────────────────────────────┘Why does this work? Because when the model generates intermediate steps, each step becomes part of its context. The model can “see” its own reasoning and use it to guide the next step. It’s like the difference between doing math in your head versus writing it down.
** Did You Know?**
The Chain-of-Thought paper by Wei et al. (2022) at Google Brain showed that adding “Let’s think step by step” improved accuracy on the GSM8K math benchmark from 17.9% to 57.1% - a 3x improvement from just 6 words!
The paper was initially met with skepticism. “You’re just asking it to show its work,” critics said. But that’s exactly the point - the “work” IS the reasoning. Without it, the model has no intermediate representation to build upon.
Theory
Section titled “Theory”Why Reasoning is Hard for LLMs
Section titled “Why Reasoning is Hard for LLMs”Large Language Models are fundamentally next-token predictors. They’re trained to answer: “Given this context, what token comes next?”
Think of an LLM like a brilliant improvisational actor who has read every book ever written. Ask them to play a mathematician, and they’ll deliver a convincing performance—the mannerisms, the vocabulary, the confident delivery. But ask them to actually prove a theorem, and the performance falls apart. The actor was trained to look like they’re doing math, not to actually do it.
This creates a fundamental tension:
┌─────────────────────────────────────────────────────────────────┐│ THE REASONING PROBLEM │├─────────────────────────────────────────────────────────────────┤│ ││ What LLMs are trained to do: ││ ──────────────────────────── ││ Context: "The capital of France is" ││ → Predict: "Paris" (statistically most likely continuation) ││ ││ What reasoning requires: ││ ─────────────────────── ││ - Breaking down problems into steps ││ - Maintaining intermediate state ││ - Backtracking when needed ││ - Verifying consistency ││ ││ The mismatch: ││ ───────────── ││ Next-token prediction SKIPS intermediate reasoning. ││ The model wants to jump straight to the "answer token" ││ without computing the steps that justify it. ││ │└─────────────────────────────────────────────────────────────────┘Chain-of-Thought prompting solves this by making the intermediate steps part of the output. The model must generate reasoning tokens BEFORE answer tokens, forcing it to “do the work.”
Think of it like the difference between asking someone “What’s 347 × 28?” and asking them to “Show your work.” When you have to write down each step—347 × 8 = 2,776, then 347 × 20 = 6,940, then 2,776 + 6,940 = 9,716—you can’t skip the actual computation. The written steps ARE the computation.
Chain-of-Thought Prompting
Section titled “Chain-of-Thought Prompting”Chain-of-Thought (CoT) prompting is a technique that encourages LLMs to generate intermediate reasoning steps before producing a final answer.
The Basic Idea
Section titled “The Basic Idea”Instead of:
Input: QuestionOutput: AnswerWe get:
Input: Question + "Think step by step"Output: Step 1 → Step 2 → Step 3 → AnswerWhy It Works: Three Mechanisms
Section titled “Why It Works: Three Mechanisms”1. Decomposition Complex problems are broken into simpler sub-problems. Each sub-problem can be solved more reliably.
2. Error Correction When reasoning is visible, errors in early steps can influence (and sometimes be corrected in) later steps.
3. Context Extension The generated reasoning becomes part of the context for generating the answer, providing “working memory.”
┌─────────────────────────────────────────────────────────────────┐│ HOW COT EXTENDS "WORKING MEMORY" │├─────────────────────────────────────────────────────────────────┤│ ││ Without CoT - All in "hidden" computation: ││ ┌─────────────────────────────────────┐ ││ │ Q: Complex problem │ ││ │ [Black box neural network magic] │ ││ │ A: 42 │ ││ └─────────────────────────────────────┘ ││ ││ With CoT - Reasoning in context: ││ ┌─────────────────────────────────────┐ ││ │ Q: Complex problem │ ││ │ Step 1: First, I notice that... │ ← In context! ││ │ Step 2: This means... │ ← In context! ││ │ Step 3: Combining these... │ ← In context! ││ │ A: 28 │ ← Can "see" reasoning ││ └─────────────────────────────────────┘ ││ ││ The reasoning tokens act as external working memory! ││ │└─────────────────────────────────────────────────────────────────┘Zero-Shot Chain-of-Thought
Section titled “Zero-Shot Chain-of-Thought”The simplest form of CoT requires just a magic phrase added to your prompt:
“Let’s think step by step.”
That’s it. These 5 words can dramatically improve reasoning on many tasks.
# Without CoTprompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.Each can has 3 tennis balls. How many tennis balls does he have now?A:"""
# With Zero-Shot CoTprompt = """Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls.Each can has 3 tennis balls. How many tennis balls does he have now?A: Let's think step by step."""Other Effective Zero-Shot Triggers
Section titled “Other Effective Zero-Shot Triggers”Different phrasings work for different tasks:
| Trigger Phrase | Best For |
|---|---|
| ”Let’s think step by step” | General reasoning, math |
| ”Let’s break this down” | Complex multi-part problems |
| ”Let’s analyze this carefully” | Logical analysis |
| ”First, let’s understand the problem” | Word problems |
| ”Let me work through this” | Calculations |
| ”Let’s consider each option” | Multiple choice |
** Did You Know?**
Kojima et al. (2022) tested 60+ different trigger phrases. “Let’s think step by step” consistently outperformed all others, but the exact wording matters less than the presence of ANY reasoning trigger. Even “Think” alone helps!
The researchers also discovered that asking the model to “Think carefully” or “Make sure you’re right” actually HURT performance. These create anxiety-like patterns that lead to overthinking and second-guessing. Neutral, process-focused triggers work best.
Few-Shot Chain-of-Thought
Section titled “Few-Shot Chain-of-Thought”For more complex or domain-specific reasoning, provide examples of the desired reasoning pattern:
few_shot_cot_prompt = """Solve the following math problems. Show your reasoning step by step.
Example 1:Q: A store has 50 shirts. They sell 23 and receive a shipment of 30.How many shirts do they have now?A: Let's solve this step by step:1. Starting shirts: 502. After selling 23: 50 - 23 = 273. After receiving 30: 27 + 30 = 57Therefore, they have 57 shirts.
Example 2:Q: A baker makes 12 cakes. She gives 4 to neighbors and bakes 8 more.How many cakes does she have?A: Let's solve this step by step:1. Starting cakes: 122. After giving 4 away: 12 - 4 = 83. After baking 8 more: 8 + 8 = 16Therefore, she has 16 cakes.
Now solve this problem:Q: A library has 85 books. They lend out 32 and receive a donation of 45.How many books do they have now?A: Let's solve this step by step:"""The Power of Examples
Section titled “The Power of Examples”Few-shot CoT is more powerful than zero-shot because:
- Pattern Learning: Model learns YOUR reasoning style
- Format Consistency: Output follows your desired structure
- Domain Adaptation: Examples can encode domain knowledge
- Error Prevention: Examples show what NOT to do
Think of few-shot CoT like training a new employee by showing them how you handle similar problems. You don’t just say “figure it out”—you walk them through example cases: “When a customer asks about refunds, first check their purchase date, then verify the item condition, then calculate the refund amount.” After seeing a few examples, they understand not just WHAT to do, but HOW you want them to think about the problem.
┌─────────────────────────────────────────────────────────────────┐│ ZERO-SHOT vs FEW-SHOT COT │├─────────────────────────────────────────────────────────────────┤│ ││ Zero-Shot CoT: ││ + Simple - just add trigger phrase ││ + Works across domains ││ - Less control over reasoning format ││ - May not match domain conventions ││ ││ Few-Shot CoT: ││ + High control over format ││ + Domain-specific patterns ││ + More consistent quality ││ - Requires good examples ││ - Uses more context tokens ││ ││ Recommendation: Start with zero-shot, add examples ││ if quality is insufficient. ││ │└─────────────────────────────────────────────────────────────────┘ReAct: Reasoning and Acting
Section titled “ReAct: Reasoning and Acting”ReAct (Reason + Act) combines chain-of-thought reasoning with the ability to take actions (use tools). This is the foundation of modern AI agents.
Think of ReAct like a detective solving a case. A bad detective just thinks about the crime from their desk—they might come up with theories, but they never verify them. A good detective alternates between thinking (“The suspect had motive, but do they have an alibi?”) and investigating (“Let me check the security footage for that night”). ReAct gives AI this same investigative loop: think about what you know, act to learn more, observe the results, repeat.
The ReAct Pattern
Section titled “The ReAct Pattern”Thought: I need to find information about XAction: search("X")Observation: [search results]Thought: Based on this, I should now...Action: calculate(...)Observation: [calculation result]Thought: I now have enough information to answerFinal Answer: ...Why ReAct Matters
Section titled “Why ReAct Matters”Traditional CoT has a limitation: all reasoning happens in a single pass, using only the information in the prompt. ReAct solves this by:
- Interleaving thinking and acting
- Grounding reasoning in real observations
- Adapting based on new information
┌─────────────────────────────────────────────────────────────────┐│ COT vs REACT │├─────────────────────────────────────────────────────────────────┤│ ││ Chain-of-Thought: ││ ┌─────────────────────────────────────┐ ││ │ Think → Think → Think → Answer │ ││ └─────────────────────────────────────┘ ││ (All reasoning from initial context) ││ ││ ReAct: ││ ┌─────────────────────────────────────────────────────────┐ ││ │ Think → Act → Observe → Think → Act → Observe → Answer │ ││ └─────────────────────────────────────────────────────────┘ ││ (Reasoning grounded in real observations) ││ ││ ReAct agents can: ││ - Gather information they don't have ││ - Verify their assumptions ││ - Adapt to unexpected findings ││ - Complete multi-step tasks ││ │└─────────────────────────────────────────────────────────────────┘** Did You Know?**
The ReAct paper (Yao et al., 2022) from Princeton and Google showed that combining reasoning traces with actions outperformed both:
- Pure reasoning (CoT alone): Good at planning, bad at getting facts
- Pure acting (actions only): Good at facts, bad at planning
ReAct achieved state-of-the-art results on knowledge-intensive tasks by letting the model “think about what to look up” and “think about what the results mean.”
Implementing ReAct
Section titled “Implementing ReAct”Here’s how ReAct works in practice:
REACT_PROMPT = """You are an assistant that uses tools to answer questions.
You have access to these tools:- search(query): Search for information- calculate(expression): Do math calculations- lookup(entity): Get facts about an entity
Use this format:
Question: [the question]Thought: [your reasoning about what to do]Action: [tool_name(arguments)]Observation: [tool result]... (repeat Thought/Action/Observation as needed)Thought: I now have enough information to answerFinal Answer: [your answer]
Begin!
Question: What is the population of France divided by 3?Thought: I need to find the population of France first, then divide by 3.Action: lookup("France population")Observation: France has a population of approximately 67.75 million (2023).Thought: Now I need to divide 67.75 million by 3.Action: calculate("67750000 / 3")Observation: 22583333.33Thought: I have the answer now.Final Answer: The population of France (67.75 million) divided by 3 is approximately 22.58 million."""The ReAct Loop in Code
Section titled “The ReAct Loop in Code”def react_loop(question: str, tools: dict, max_iterations: int = 10): """Execute a ReAct reasoning loop."""
prompt = f"{REACT_PROMPT}\n\nQuestion: {question}\n" history = []
for i in range(max_iterations): # Get model response response = llm(prompt)
# Check if we have a final answer if "Final Answer:" in response: return extract_final_answer(response)
# Parse the thought and action thought = extract_thought(response) action_name, action_args = extract_action(response)
# Execute the action if action_name in tools: observation = tools[action_name](action_args) else: observation = f"Unknown tool: {action_name}"
# Add to history and prompt step = f"Thought: {thought}\nAction: {action_name}({action_args})\nObservation: {observation}\n" history.append(step) prompt += step
return "Max iterations reached without final answer"Self-Consistency
Section titled “Self-Consistency”Self-consistency is a powerful technique that improves CoT reliability by sampling multiple reasoning paths and selecting the most common answer.
Think of it like asking five experts to independently solve the same problem. If four of them get “42” and one gets “47,” you’re probably safe trusting “42.” The outlier likely made an arithmetic error or misread something. Self-consistency applies this same wisdom-of-crowds principle to AI reasoning.
The Insight
Section titled “The Insight”Different reasoning paths might lead to the same correct answer through different routes:
┌─────────────────────────────────────────────────────────────────┐│ SELF-CONSISTENCY │├─────────────────────────────────────────────────────────────────┤│ ││ Question: "How many legs do 3 dogs and 2 cats have?" ││ ││ Path 1: ││ "Dogs have 4 legs each: 3 × 4 = 12 ││ Cats have 4 legs each: 2 × 4 = 8 ││ Total: 12 + 8 = 20" ││ ││ Path 2: ││ "3 dogs + 2 cats = 5 animals ││ Each animal has 4 legs ││ 5 × 4 = 20" ││ ││ Path 3: ││ "Let me count: 4 + 4 + 4 + 4 + 4 = 20" ││ ││ All paths → 20 ││ (High confidence in answer) ││ │└─────────────────────────────────────────────────────────────────┘If one path gives 20 but another gives 18, the inconsistency signals potential error.
Implementation
Section titled “Implementation”def self_consistent_cot(question: str, num_samples: int = 5, temperature: float = 0.7): """Generate multiple reasoning paths and vote on the answer."""
answers = []
for _ in range(num_samples): # Generate reasoning with some temperature for diversity response = llm( prompt=f"{question}\n\nLet's think step by step.", temperature=temperature )
# Extract the final answer answer = extract_answer(response) answers.append(answer)
# Return most common answer (majority vote) from collections import Counter answer_counts = Counter(answers) most_common = answer_counts.most_common(1)[0]
return { "answer": most_common[0], "confidence": most_common[1] / num_samples, "all_answers": answers }** Did You Know?**
Wang et al. (2022) showed that self-consistency with just 5 samples improved CoT accuracy from 58% to 74% on math problems - a 28% relative improvement!
The technique works because errors are typically “random” - different runs make different mistakes. But correct reasoning tends to converge on the same answer. It’s like having a panel of experts vote.
When CoT Helps (and When It Doesn’t)
Section titled “When CoT Helps (and When It Doesn’t)”Chain-of-thought isn’t magic. It helps in specific situations and can actually hurt in others.
CoT Helps Most With
Section titled “CoT Helps Most With”| Task Type | Why CoT Helps | Improvement |
|---|---|---|
| Math word problems | Forces calculation steps | 2-4x |
| Multi-step reasoning | Makes dependencies explicit | 2-3x |
| Logical deduction | Tracks premises and conclusions | 1.5-2x |
| Commonsense reasoning | Surfaces implicit assumptions | 1.3-1.5x |
| Code debugging | Forces systematic analysis | 1.5-2x |
CoT Can Hurt With
Section titled “CoT Can Hurt With”| Task Type | Why CoT Hurts | Notes |
|---|---|---|
| Simple factual recall | Adds unnecessary steps | ”Capital of France” |
| Pattern matching | Overthinks simple patterns | Sentiment classification |
| High-volume classification | Too slow | Batch processing |
| Creative tasks | Can constrain creativity | Poetry, brainstorming |
┌─────────────────────────────────────────────────────────────────┐│ WHEN TO USE COT │├─────────────────────────────────────────────────────────────────┤│ ││ USE COT WHEN: ││ - Problem requires multiple reasoning steps ││ - Answer depends on intermediate calculations ││ - Task involves combining multiple pieces of information ││ - You need to verify/debug the reasoning process ││ - Domain is unfamiliar to the model ││ ││ AVOID COT WHEN: ││ - Simple one-step tasks ││ - Speed is critical (real-time applications) ││ - Task is well-represented in training data ││ - Creative/open-ended generation ││ - Token budget is very limited ││ │└─────────────────────────────────────────────────────────────────┘The Limits of Reasoning
Section titled “The Limits of Reasoning”Even with CoT, LLMs have fundamental reasoning limitations:
1. Compositional Generalization
Section titled “1. Compositional Generalization”LLMs struggle when problems require combining known concepts in novel ways:
# Training: "John is taller than Mary. Mary is taller than Bob. Who is shortest?"# Test: "A is heavier than B. B is heavier than C. C is heavier than D. Who is lightest?"
# The model might know how to do 2-step comparisons but fail at 4-step2. Arithmetic Precision
Section titled “2. Arithmetic Precision”Even with CoT, LLMs make arithmetic errors, especially with:
- Large numbers
- Many decimal places
- Complex operations
Solution: Use calculator tools (ReAct pattern!)
3. Hallucinated Reasoning
Section titled “3. Hallucinated Reasoning”The model can generate convincing but WRONG reasoning:
Q: "Is 17 a prime number?"A: "Let's check: 17 ÷ 2 = 8.5 (not whole), 17 ÷ 3 = 5.67 (not whole), 17 ÷ 4 = 4.25 (not whole), 17 ÷ 5 = 3.4 (not whole). Since no divisors found, 17 is prime."
Q: "Is 51 a prime number?"A: "Let's check: 51 ÷ 2 = 25.5 (not whole), 51 ÷ 3 = 17 (whole!). Wait, 51 = 3 × 17, so 51 is NOT prime."
Q: "Is 91 a prime number?"A: "Let's check: 91 ÷ 2 = 45.5 (not whole), 91 ÷ 3 = 30.33 (not whole), 91 ÷ 5 = 18.2 (not whole), 91 ÷ 7 = 13 (whole!). So 91 = 7 × 13, NOT prime."
# But sometimes:A: "Let's check: 91 ÷ 2 = 45.5, 91 ÷ 3 = 30.33, 91 ÷ 5 = 18.2. No small divisors found, so 91 is prime." (Forgot to check 7!)4. Path Dependence
Section titled “4. Path Dependence”The model’s reasoning can be influenced by:
- Order of information in prompt
- How the question is phrased
- Examples provided
** Did You Know?**
Researchers at Anthropic discovered that Claude’s reasoning performance varies significantly based on problem framing. Asking “What’s wrong with this code?” produces different (often better) debugging than “Is this code correct?”
The insight: LLMs don’t truly “reason” - they pattern-match on how problems are presented. This is why prompt engineering matters so much.
Advanced CoT Techniques
Section titled “Advanced CoT Techniques”1. Least-to-Most Prompting
Section titled “1. Least-to-Most Prompting”Break complex problems into sub-problems, solve from simplest to hardest:
prompt = """To solve complex problems, first break them into simpler sub-problems.
Problem: "Last year, Amy was twice as old as Ben. This year, Amy is 20.How old is Ben this year?"
Sub-problems:1. How old was Amy last year?2. How old was Ben last year (given Amy was twice his age)?3. How old is Ben this year?
Solving each:1. Amy is 20 this year, so last year she was 20 - 1 = 192. If Amy was twice Ben's age: 19 = 2 × Ben's age last year Ben's age last year = 19 / 2 = 9.53. Ben this year = 9.5 + 1 = 10.5 years old
Final answer: Ben is 10.5 years old."""2. Tree of Thoughts (ToT)
Section titled “2. Tree of Thoughts (ToT)”Explore multiple reasoning branches, backtrack when needed:
┌─────────────────────────────────────────────────────────────────┐│ TREE OF THOUGHTS │├─────────────────────────────────────────────────────────────────┤│ ││ Problem ││ │ ││ ┌────────────┼────────────┐ ││ │ │ │ ││ Path A Path B Path C ││ │ │ │ ││ Step 1 Step 1 Step 1 ││ │ │ │ ││ (dead end) Step 2 Step 2 ││ │ │ ││ Step 3 (dead end) ││ │ ││ Answer ││ ││ Unlike linear CoT, ToT can backtrack and explore alternatives ││ │└─────────────────────────────────────────────────────────────────┘3. Program-Aided Language Models (PAL)
Section titled “3. Program-Aided Language Models (PAL)”Generate code instead of natural language reasoning:
prompt = """Problem: "A store has 3 shelves. Each shelf has 4 boxes. Each box has 5 items.How many items total?"
# Python solutionshelves = 3boxes_per_shelf = 4items_per_box = 5
total_boxes = shelves * boxes_per_shelf # 12 boxestotal_items = total_boxes * items_per_box # 60 items
print(f"Total items: {total_items}")# Output: Total items: 60"""The model generates code, which is then executed for the actual answer. This eliminates arithmetic errors!
Think of PAL like giving a student a calculator during a word problem test. They still need to understand the problem—what to multiply, what to add, in what order—but the actual number-crunching is outsourced to a tool that won’t make silly mistakes. The LLM’s job becomes translating human language into precise computational steps, which it’s actually quite good at.
4. Structured Chain-of-Thought
Section titled “4. Structured Chain-of-Thought”Sometimes you want even more control over the reasoning format. Structured CoT uses XML tags, JSON, or specific markers to organize reasoning:
structured_cot_prompt = """Analyze this problem using structured reasoning.
Problem: "Should we expand our product line to include organic options?"
<analysis><context>- Current market: mainstream products- Competitor landscape: 3 of 5 competitors have organic lines- Customer feedback: 23% of surveys mention organic preferences</context>
<factors><factor name="market_demand" importance="high">Growing organic food market (12% YoY growth)</factor><factor name="cost" importance="medium">Organic ingredients cost 40% more on average</factor><factor name="brand_alignment" importance="high">Company mission includes sustainability focus</factor></factors>
<reasoning>Step 1: Market demand is strong (12% growth) and customer surveys show interest (23%)Step 2: Cost increase (40%) can be offset by premium pricing (organic products command 30-50% premiums)Step 3: Brand alignment is strong - this fits our sustainability missionStep 4: Competitive pressure - 3/5 competitors already offer organic</reasoning>
<conclusion>Recommendation: YES, expand to organic optionsConfidence: 78%Key risk: Supply chain complexity for organic certification</conclusion></analysis>"""This approach is particularly powerful for:
- Auditable decisions: Each step is clearly labeled and can be reviewed
- Multi-criteria analysis: Structure forces consideration of all factors
- Integration with systems: Structured output can be parsed programmatically
- Consistent quality: The template ensures nothing is forgotten
5. Chain-of-Verification (CoVe)
Section titled “5. Chain-of-Verification (CoVe)”A technique where the model generates an initial response, then generates verification questions, answers them, and revises if needed:
Initial Answer: "The capital of Australia is Sydney."
Verification Questions:1. Is Sydney the largest city in Australia? → Yes2. Is the largest city always the capital? → No (e.g., New York vs DC)3. What is the purpose-built capital of Australia? → Canberra
Revised Answer: "The capital of Australia is Canberra, not Sydney. While Sydney is the largest city, Canberra was purpose-built as the capital in 1913."CoVe is especially effective for factual claims where the model might confuse similar concepts. It’s like the model fact-checking its own homework.
Practical Guidelines
Section titled “Practical Guidelines”Choosing Your CoT Strategy
Section titled “Choosing Your CoT Strategy”┌─────────────────────────────────────────────────────────────────┐│ COT DECISION TREE │├─────────────────────────────────────────────────────────────────┤│ ││ Is the task simple (< 2 reasoning steps)? ││ └─ YES → Don't use CoT (direct prompting) ││ └─ NO → Continue... ││ ││ Do you have good example reasoning traces? ││ └─ YES → Use Few-Shot CoT ││ └─ NO → Use Zero-Shot CoT ("Let's think step by step") ││ ││ Does the task require external information or actions? ││ └─ YES → Use ReAct pattern ││ └─ NO → Continue... ││ ││ Is high reliability critical? ││ └─ YES → Add Self-Consistency (multiple samples) ││ └─ NO → Single CoT pass is fine ││ ││ Does the task involve math/calculations? ││ └─ YES → Consider PAL (code generation) or calculator tool ││ └─ NO → Standard CoT ││ │└─────────────────────────────────────────────────────────────────┘Crafting CoT Prompts
Section titled “Crafting CoT Prompts”DO:
- Be specific about desired output format
- Show examples of good reasoning
- Ask for verification steps
- Request intermediate calculations be shown
DON’T:
- Make instructions too long (model forgets)
- Ask model to “be careful” (causes overthinking)
- Use CoT for simple tasks (wastes tokens)
- Trust complex arithmetic without tools
Hands-On Practice
Section titled “Hands-On Practice”Exercise 1: Zero-Shot CoT Comparison
Section titled “Exercise 1: Zero-Shot CoT Comparison”Compare model performance with and without Chain-of-Thought prompting on reasoning tasks.
"""Exercise 1: Zero-Shot CoT Comparison
This exercise demonstrates the dramatic difference CoT makeson multi-step reasoning problems."""
from openai import OpenAI
client = OpenAI()
def test_cot_vs_direct(problem: str) -> dict: """Compare direct prompting vs CoT on the same problem."""
# Direct prompting (no CoT) direct_prompt = f"""{problem}
Answer with just the final number."""
# CoT prompting cot_prompt = f"""{problem}
Let's think step by step."""
# Get direct answer direct_response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": direct_prompt}], temperature=0 )
# Get CoT answer cot_response = client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": cot_prompt}], temperature=0 )
return { "problem": problem, "direct_answer": direct_response.choices[0].message.content, "cot_answer": cot_response.choices[0].message.content }
# Test problems of increasing difficultytest_problems = [ "A farmer has 17 sheep. All but 9 die. How many sheep are left?",
"If you have 3 quarters, 4 dimes, and 2 nickels, how much money do you have in cents?",
"A bat and a ball cost $1.10 together. The bat costs $1.00 more than the ball. How much does the ball cost?",
"Three friends split a restaurant bill. The total was $45, plus 20% tip. They each paid equally. One friend paid with a $20 bill. How much change did they receive?",]
# Run comparisonsfor problem in test_problems: result = test_cot_vs_direct(problem) print(f"Problem: {result['problem'][:50]}...") print(f"Direct: {result['direct_answer'][:50]}") print(f"CoT: {result['cot_answer'][:100]}") print("-" * 50)What to observe:
- The “bat and ball” problem is a classic cognitive bias trap (many humans get it wrong too!)
- CoT helps catch the “all but 9 die” trick question
- Complex multi-step problems show the biggest improvement
Exercise 2: Building a ReAct Agent
Section titled “Exercise 2: Building a ReAct Agent”Implement a simple ReAct agent that can use tools to answer questions.
"""Exercise 2: ReAct Agent Implementation
Build a reasoning agent that interleaves thinking and action."""
import reimport jsonfrom typing import Callable
# Define some simple toolsdef search(query: str) -> str: """Simulate a search tool.""" # In production, this would call a real search API knowledge_base = { "python creator": "Guido van Rossum created Python in 1991.", "eiffel tower height": "The Eiffel Tower is 330 meters tall.", "moon distance": "The Moon is about 384,400 km from Earth.", "speed of light": "The speed of light is 299,792,458 meters per second.", }
for key, value in knowledge_base.items(): if key in query.lower(): return value return "No relevant information found."
def calculate(expression: str) -> str: """Safely evaluate a mathematical expression.""" try: # Only allow safe math operations allowed_chars = set("0123456789+-*/.(). ") if all(c in allowed_chars for c in expression): result = eval(expression) return str(result) return "Invalid expression" except Exception as e: return f"Calculation error: {e}"
def lookup(entity: str) -> str: """Look up facts about an entity.""" facts = { "France": "Country in Western Europe. Population: 67 million. Capital: Paris.", "Python": "High-level programming language. Created by Guido van Rossum in 1991.", "Einstein": "Physicist who developed theory of relativity. Nobel Prize 1921.", } return facts.get(entity, f"No facts found for '{entity}'")
class ReActAgent: """A simple ReAct reasoning agent."""
def __init__(self, llm_client, tools: dict[str, Callable]): self.llm = llm_client self.tools = tools
def parse_response(self, response: str) -> tuple[str, str, str]: """Parse thought, action, and action input from response.""" thought_match = re.search(r"Thought:\s*(.+?)(?=Action:|$)", response, re.DOTALL) action_match = re.search(r"Action:\s*(\w+)\s*\((.+?)\)", response)
thought = thought_match.group(1).strip() if thought_match else ""
if action_match: action_name = action_match.group(1) action_input = action_match.group(2).strip('"\'') return thought, action_name, action_input
# Check for final answer final_match = re.search(r"Final Answer:\s*(.+)", response, re.DOTALL) if final_match: return thought, "FINAL", final_match.group(1).strip()
return thought, None, None
def run(self, question: str, max_iterations: int = 5) -> str: """Execute the ReAct loop."""
system_prompt = """You are an assistant that uses tools to answer questions.
Available tools:- search(query): Search for information- calculate(expression): Do math calculations- lookup(entity): Get facts about an entity
Always use this format:
Thought: [your reasoning]Action: tool_name("argument")
After getting an observation, continue with another Thought/Action or give Final Answer:
Thought: [your reasoning]Final Answer: [your answer]"""
messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": f"Question: {question}"} ]
for i in range(max_iterations): # Get model response response = self.llm.chat.completions.create( model="gpt-5", messages=messages, temperature=0 )
assistant_message = response.choices[0].message.content messages.append({"role": "assistant", "content": assistant_message})
# Parse the response thought, action_name, action_input = self.parse_response(assistant_message)
print(f"[Step {i+1}]") print(f"Thought: {thought}")
if action_name == "FINAL": print(f"Final Answer: {action_input}") return action_input
if action_name and action_name in self.tools: # Execute the tool observation = self.tools[action_name](action_input) print(f"Action: {action_name}({action_input})") print(f"Observation: {observation}")
# Add observation to messages messages.append({ "role": "user", "content": f"Observation: {observation}" }) else: print(f"Unknown action: {action_name}") break
return "Max iterations reached without final answer"
# Test the agentif __name__ == "__main__": from openai import OpenAI client = OpenAI()
agent = ReActAgent( llm_client=client, tools={ "search": search, "calculate": calculate, "lookup": lookup } )
# Test questions questions = [ "How tall is the Eiffel Tower in feet?", "Who created Python and in what year?", "If France's population is 67 million and each person produces 2kg of waste daily, how much total waste is produced per year in million kg?", ]
for q in questions: print(f"\n{'='*60}") print(f"Question: {q}") print('='*60) answer = agent.run(q) print(f"\n→ Answer: {answer}\n")Exercise 3: Self-Consistency Implementation
Section titled “Exercise 3: Self-Consistency Implementation”Build a self-consistency wrapper that improves reliability through multiple samples.
"""Exercise 3: Self-Consistency for Robust Reasoning
Sample multiple reasoning paths and vote on the most common answer."""
from collections import Counterimport re
def extract_numeric_answer(response: str) -> str: """Extract the final numeric answer from a CoT response.""" # Look for patterns like "the answer is X" or "= X" at the end patterns = [ r"(?:answer|result|total).*?(\d+\.?\d*)", r"=\s*(\d+\.?\d*)\s*$", r"(\d+\.?\d*)\s*(?:is the answer|total|result)", ]
for pattern in patterns: matches = re.findall(pattern, response.lower()) if matches: return matches[-1] # Return last match
# Fallback: find any numbers and return the last one numbers = re.findall(r'\d+\.?\d*', response) return numbers[-1] if numbers else None
def self_consistent_answer( question: str, llm_client, num_samples: int = 5, temperature: float = 0.7) -> dict: """Get a robust answer using self-consistency."""
prompt = f"""{question}
Let's work through this step by step."""
answers = [] reasoning_paths = []
for i in range(num_samples): response = llm_client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": prompt}], temperature=temperature )
reasoning = response.choices[0].message.content answer = extract_numeric_answer(reasoning)
answers.append(answer) reasoning_paths.append(reasoning)
print(f"Sample {i+1}: Answer = {answer}")
# Vote on the most common answer answer_counts = Counter(answers) most_common = answer_counts.most_common(1)[0]
return { "answer": most_common[0], "confidence": most_common[1] / num_samples, "all_answers": answers, "vote_distribution": dict(answer_counts), "reasoning_paths": reasoning_paths }
# Test self-consistencyif __name__ == "__main__": from openai import OpenAI client = OpenAI()
# A problem where models sometimes make errors problem = """ A store had 125 apples. They sold 47 in the morning and 38 in the afternoon. A delivery of 60 apples arrived. Then they sold 29 more before closing. How many apples did they have at the end of the day? """
result = self_consistent_answer(problem, client, num_samples=5)
print(f"\n{'='*60}") print(f"Final Answer: {result['answer']}") print(f"Confidence: {result['confidence']*100:.0f}%") print(f"Vote Distribution: {result['vote_distribution']}")Exercise 4: Building a Few-Shot CoT Prompt Library
Section titled “Exercise 4: Building a Few-Shot CoT Prompt Library”Create reusable CoT prompt templates for different reasoning domains.
"""Exercise 4: Few-Shot CoT Prompt Library
Build a library of domain-specific CoT prompts for consistent results."""
from dataclasses import dataclassfrom typing import Optional
@dataclassclass CoTPromptTemplate: """A template for few-shot CoT prompting.""" name: str domain: str description: str examples: list[dict[str, str]] # List of {"question": ..., "reasoning": ..., "answer": ...} instructions: str
def build_prompt(self, question: str, max_examples: int = 2) -> str: """Build a complete prompt with examples and the new question."""
# Start with instructions prompt_parts = [self.instructions, ""]
# Add examples for i, example in enumerate(self.examples[:max_examples]): prompt_parts.append(f"Example {i+1}:") prompt_parts.append(f"Q: {example['question']}") prompt_parts.append(f"A: {example['reasoning']}") prompt_parts.append(f"Therefore, the answer is: {example['answer']}") prompt_parts.append("")
# Add the new question prompt_parts.append("Now solve this problem:") prompt_parts.append(f"Q: {question}") prompt_parts.append("A: Let's think step by step.")
return "\n".join(prompt_parts)
# Create a library of promptsCOT_LIBRARY = { "math_word_problem": CoTPromptTemplate( name="Math Word Problems", domain="mathematics", description="Multi-step arithmetic word problems", instructions="Solve the following math problems step by step. Show your work clearly.", examples=[ { "question": "A bakery makes 45 loaves of bread. They sell 23 in the morning and make 30 more. How many do they have?", "reasoning": "Let's work through this:\n1. Start with 45 loaves\n2. After selling 23: 45 - 23 = 22 loaves\n3. After making 30 more: 22 + 30 = 52 loaves", "answer": "52 loaves" }, { "question": "A movie theater has 8 rows with 12 seats each. If 67 seats are taken, how many are empty?", "reasoning": "Let's calculate:\n1. Total seats: 8 × 12 = 96 seats\n2. Empty seats: 96 - 67 = 29 seats", "answer": "29 empty seats" } ] ),
"logical_deduction": CoTPromptTemplate( name="Logical Deduction", domain="logic", description="Problems requiring logical reasoning and deduction", instructions="Analyze the following logical problems. State your premises and derive conclusions step by step.", examples=[ { "question": "All mammals are warm-blooded. All dogs are mammals. Is a golden retriever warm-blooded?", "reasoning": "Let's analyze:\n1. Premise 1: All mammals are warm-blooded\n2. Premise 2: All dogs are mammals\n3. A golden retriever is a type of dog\n4. Therefore, a golden retriever is a mammal (from premise 2)\n5. Therefore, a golden retriever is warm-blooded (from premise 1)", "answer": "Yes, a golden retriever is warm-blooded" } ] ),
"code_debugging": CoTPromptTemplate( name="Code Debugging", domain="programming", description="Analyzing and fixing code issues", instructions="Debug the following code by analyzing it step by step. Identify the issue and explain the fix.", examples=[ { "question": "Why does this code print wrong results? `for i in range(10): total += i`", "reasoning": "Let's trace through:\n1. The loop iterates i from 0 to 9\n2. Each iteration adds i to 'total'\n3. BUG: 'total' is never initialized!\n4. Python will raise NameError: 'total' is not defined\n5. Fix: Add `total = 0` before the loop", "answer": "The bug is that 'total' is not initialized. Add `total = 0` before the loop." } ] ),
"causal_reasoning": CoTPromptTemplate( name="Causal Reasoning", domain="causality", description="Analyzing cause and effect relationships", instructions="Analyze the causal relationships in the following scenarios. Consider multiple factors and their interactions.", examples=[ { "question": "Sales increased after a new advertising campaign launched. Does this prove the campaign caused higher sales?", "reasoning": "Let's analyze causation vs correlation:\n1. Observed: Sales increased after campaign launch\n2. This shows correlation (events happened together)\n3. But other factors could explain it:\n - Seasonal shopping patterns\n - Economic conditions improving\n - Competitor going out of business\n - New product features released simultaneously\n4. To prove causation, we would need:\n - Control group (no advertising)\n - Or A/B testing\n - Or other causal inference methods", "answer": "No, correlation doesn't prove causation. Other factors could explain the increase. A controlled experiment would be needed to establish causality." } ] )}
def solve_with_template( question: str, template_name: str, llm_client, temperature: float = 0) -> str: """Solve a problem using a specific CoT template."""
if template_name not in COT_LIBRARY: raise ValueError(f"Unknown template: {template_name}. Available: {list(COT_LIBRARY.keys())}")
template = COT_LIBRARY[template_name] prompt = template.build_prompt(question)
response = llm_client.chat.completions.create( model="gpt-5", messages=[{"role": "user", "content": prompt}], temperature=temperature )
return response.choices[0].message.content
# Example usageif __name__ == "__main__": from openai import OpenAI client = OpenAI()
# Use the math template math_problem = "A train leaves at 2:00 PM traveling at 60 mph. Another train leaves from the same station at 3:00 PM traveling at 80 mph. At what time will the second train catch up to the first?"
print("=== Math Problem ===") print(f"Q: {math_problem}") print(f"\nA: {solve_with_template(math_problem, 'math_word_problem', client)}")
# Use the logic template logic_problem = "If it rains, the ground is wet. The ground is wet. Can we conclude it rained?"
print("\n=== Logic Problem ===") print(f"Q: {logic_problem}") print(f"\nA: {solve_with_template(logic_problem, 'logical_deduction', client)}")Key Takeaways
Section titled “Key Takeaways”-
“Let’s think step by step” - These 5 words can transform model performance on reasoning tasks
-
CoT makes reasoning visible - The model’s “thinking” becomes part of its context, enabling better outputs
-
ReAct combines thinking and doing - The foundation of modern AI agents
-
Self-consistency improves reliability - Multiple reasoning paths catch errors
-
Know the limits - CoT helps with complex reasoning but isn’t magic for all tasks
-
Use tools for calculations - Don’t trust LLMs for math; use calculators
Did You Know?
Section titled “Did You Know?”The Accidental Discovery
Section titled “The Accidental Discovery”Chain-of-thought prompting was partially discovered by accident. Researchers at Google were testing GPT-3 on math problems and noticed that when the model happened to “show its work” in the output, it got the answer right more often.
They asked: “What if we explicitly asked it to show its work?” The result was the CoT paper, which has been cited over 4,000 times.
The “Let’s” Breakthrough
Section titled “The “Let’s” Breakthrough”The specific phrase “Let’s think step by step” was found through systematic testing of hundreds of variations. Interestingly:
- “I will think step by step” - worse (too assertive)
- “Think step by step” - worse (too commanding)
- “You should think step by step” - worse (creates pressure)
- “Let’s think step by step” - best (collaborative, process-oriented)
The word “let’s” creates a collaborative framing that seems to work better with how LLMs were trained.
OpenAI’s Hidden Prompts
Section titled “OpenAI’s Hidden Prompts”When OpenAI released gpt-5, users discovered that behind the scenes, the system prompt included CoT-style instructions. The model was being told to “think step by step” before generating responses - they had baked CoT into the product!
This was revealed when users found ways to extract the system prompt, showing that even the model creators considered CoT essential.
The Reasoning vs Pattern Matching Debate
Section titled “The Reasoning vs Pattern Matching Debate”A controversial 2023 paper argued that LLMs don’t actually “reason” - they pattern-match on the reasoning patterns in their training data. When CoT works, it’s because the model has seen similar reasoning patterns, not because it’s truly reasoning.
This sparked a fierce debate: Does it matter if the model is “really” reasoning, as long as the outputs are correct? The pragmatic answer: probably not. But it does explain why novel reasoning problems remain hard.
AlphaProof and the Future
Section titled “AlphaProof and the Future”In 2024, DeepMind’s AlphaProof system used a combination of LLM-generated reasoning and formal verification to solve International Mathematical Olympiad problems at a silver-medal level.
The key insight: generate many reasoning attempts, verify each with a formal prover, keep the ones that work. This “generate and verify” approach may be the future of AI reasoning.
Further Reading
Section titled “Further Reading”Papers
Section titled “Papers”- Chain-of-Thought Prompting (Wei et al., 2022) - The original CoT paper
- Large Language Models are Zero-Shot Reasoners (Kojima et al., 2022) - “Let’s think step by step”
- ReAct: Synergizing Reasoning and Acting (Yao et al., 2022) - Combining thought and action
- Self-Consistency Improves Chain of Thought Reasoning (Wang et al., 2022)
- Tree of Thoughts (Yao et al., 2023) - Multi-path reasoning
Tutorials
Section titled “Tutorials”️ Next Steps
Section titled “️ Next Steps”After completing this module, you’ll be ready for:
Module 18: LangGraph for Stateful Workflows - Build sophisticated agents with persistent state, cycles, and complex control flow. LangGraph takes the ReAct pattern and scales it to production.
Last updated: 2025-11-25 Status: In Progress