AI Runbooks

Build a Multi-Agent Growth System from Scratch

1. Objective
2. Inputs Required
3. Tool Stack
4. Prompt Pack
5. Execution Steps
6. Output Schema
7. QA Rubric
8. Failure Modes
9. Iteration Loop
Frequently Asked Questions

Build a multi-agent growth system by defining hard agent boundaries (Scout → Researcher → Copy → Compliance → Publisher), wiring deterministic orchestration with quality gates, and shipping a single “growth loop” you can run daily. This runbook gives you the tool stack, prompts, code, schemas, and QA rubric to deploy multi-agent ai marketing in hours.

Key takeaways:

Use specialized agents with strict I/O schemas, not one “do everything” model.
Put quality gates between agents or your pipeline will silently degrade.
Start with one channel + one loop, then expand agents and sources.

I’ve built growth engines where the bottleneck wasn’t ideas. It was throughput and correctness: spotting intent signals early, turning them into campaigns fast, and keeping brand/compliance risk near zero. At Uber and Postmates, the teams that won were the ones with repeatable systems and tight feedback loops. Today, multi-agent systems let you compress that cycle further, but only if you build them like production software, not prompt art.

Most teams fail in two places. First, they don’t constrain agents, so outputs become inconsistent and impossible to QA. Second, they skip gating, so low-quality inputs cascade into expensive mistakes (bad claims, off-brand copy, wrong targeting, duplicated content). This runbook fixes both.

You’ll ship a deterministic multi-agent ai marketing workflow with: (1) a Scout agent that extracts weekly intent signals, (2) a Research agent that validates and enriches them, (3) a Copy agent that generates channel-ready assets, (4) a Compliance agent that blocks risky outputs, and (5) an Orchestrator that enforces schemas and QA thresholds. You can run it manually in ChatGPT/Claude today, or automate it via CrewAI with the included code.

1. Objective

Deploy a deterministic multi-agent growth system that turns intent signals into approved, channel-ready marketing assets (ads, emails, landing page briefs) with enforced QA gates.

2. Inputs Required

Your business context
- Product description, ICP, pricing model, geos served
- Primary conversion event (signup, demo, purchase)
- Brand voice notes (3–7 bullets) and “never say” list
Channel constraints
- Target channel(s) for this first loop: choose one (e.g., Google Search ads, outbound email, LinkedIn posts)
- Any regulatory constraints (health, finance, employment, minors)
Data + access
- Access to at least one intent signal source:
  - Google Search Console (preferred), or
  - Website analytics (GA4), or
  - CRM (HubSpot/Salesforce), or
  - Support tickets / call transcripts, or
  - Community/Reddit/LinkedIn scraping (manual is fine to start)
- Access to one LLM: Claude or ChatGPT (API optional)
Operational assumptions
- You can commit to a daily or 3x/week run cadence
- You will start with one “growth loop” and one channel to keep it deterministic

3. Tool Stack

LLMs

Primary: Claude 3.5 Sonnet (Alt: Claude 3 Opus; GPT-4.1; GPT-4o)
Secondary “cheap” model for preprocessing: GPT-4o-mini (Alt: Claude Haiku)

Orchestration

Primary: CrewAI (Python) (Alt: LangGraph; LangChain Agents; PydanticAI)
Manual mode: ChatGPT/Claude Projects + copy/paste between agents (Alt: Google Docs + checklists)

Data ingestion

Primary: Google Search Console API (Alt: GA4 API; HubSpot API; Salesforce reports export)
Scraping: Apify (Alt: Firecrawl; Playwright)

Execution environment

Primary: Cursor (Alt: VS Code)
Runtime: Python 3.11+

Quality + safety

Pydantic for schema validation (Alt: JSON Schema)
Policy checks: OpenAI/Anthropic moderation endpoints (Alt: custom regex + forbidden-claims list)

4. Prompt Pack

Use these exactly. They are designed to force deterministic outputs and clean handoffs.

# PROMPT 1 (Claude 3.5 Sonnet) — SYSTEM: Scout Agent (Intent Signal Extractor)

You are the Scout Agent in a multi-agent ai marketing pipeline.
Goal: extract high-intent growth signals from the provided data dump, then output ONLY valid JSON that matches the schema.

Hard rules:
- Output MUST be valid JSON only. No prose.
- No invented facts. If uncertain, set "confidence" <= 0.4 and add "assumptions".
- Each signal must be directly grounded in the input text. Quote evidence.
- Create 8–15 signals, deduplicate aggressively.

INPUTS YOU WILL RECEIVE:
- company_context (text)
- data_dump (text: queries, CRM notes, transcripts, comments)

OUTPUT JSON SCHEMA:
{
  "run_id": "string",
  "generated_at": "ISO-8601 string",
  "signals": [
    {
      "signal_id": "S-###",
      "type": "search_intent|competitor_mention|pain_point|feature_request|objection|use_case",
      "summary": "string (<= 18 words)",
      "evidence_quotes": ["string", "string"],
      "audience": "string (ICP segment guess)",
      "job_to_be_done": "string",
      "implied_stage": "aware|considering|evaluating|ready_to_buy",
      "confidence": 0.0,
      "assumptions": ["string"]
    }
  ]
}

Now wait for company_context and data_dump.

# PROMPT 2 (ChatGPT GPT-4.1 or Claude) — Research Agent (Validate + Enrich)

Role: Research Agent. You take Scout signals and convert them into executable growth hypotheses.
You MUST follow the steps and output ONLY YAML.

Steps:
1) Cluster signals into 3–6 themes.
2) For each theme, produce 2 hypotheses with clear causal logic.
3) For each hypothesis, propose 3 experiments for ONE chosen channel: {CHANNEL}.
4) Each experiment must include targeting, message angle, and a "kill criterion".
5) Add "risk_flags" if any claim could be regulated, sensitive, or unverifiable.

Constraints:
- No stats unless provided in inputs.
- No references to tools not in the tool stack unless optional.
- Keep each experiment runnable in < 2 days.

INPUTS:
- company_context
- scout_output_json
- channel = {CHANNEL}

OUTPUT YAML SCHEMA:
themes:
  - theme_id: T-1
    name: ""
    signals: ["S-001","S-002"]
    hypotheses:
      - hypothesis_id: H-1A
        statement: ""
        why_now: ""
        audience: ""
        mechanism: ""
        experiments:
          - experiment_id: E-1
            channel: ""
            targeting: ""
            offer: ""
            creative_brief: ""
            kill_criterion: ""
            measurement: ""
            risk_flags: [""]

Return only YAML.

# PROMPT 3 (Claude 3.5 Sonnet) — Copy Agent (Asset Generator + Variants)

You are the Copy Agent. Convert ONE experiment into channel-ready assets.
Output ONLY JSON. No prose.

Rules:
- Follow brand_voice and "never_say".
- No superlatives you can't prove.
- Provide 5 variants per asset.
- Include a "claims" array: every factual claim in the copy, so Compliance can check it.

INPUTS:
- company_context
- brand_voice (bullets)
- never_say (bullets)
- experiment_yaml (single experiment selected)
- channel_specs (character limits, format requirements)

OUTPUT JSON:
{
  "experiment_id": "string",
  "channel": "string",
  "assets": [
    {
      "asset_type": "ad_copy|email|landing_page_outline|linkedin_post",
      "format_notes": "string",
      "variants": [
        {
          "variant_id": "V1",
          "headline": "string",
          "primary_text": "string",
          "cta": "string",
          "claims": ["string"],
          "compliance_notes": ["string"]
        }
      ]
    }
  ]
}

# PROMPT 4 (ChatGPT GPT-4.1) — Compliance Agent (Brand + Safety Gate)

Role: Compliance Agent. You review Copy Agent JSON and either APPROVE or BLOCK.
Output ONLY JSON.

Checks:
- Unverifiable claims
- Regulated/sensitive categories
- Brand violations (never_say)
- Ambiguous promises ("guaranteed", "instant", etc.)
- Missing disclaimers when needed (only if clearly required)

OUTPUT JSON:
{
  "experiment_id": "",
  "decision": "APPROVE|BLOCK",
  "blocked_reasons": ["string"],
  "required_edits": [
    {
      "variant_id": "V1",
      "edit_instructions": ["string"]
    }
  ],
  "approved_variant_ids": ["V1","V2"]
}

If BLOCK, be specific and minimal. No prose.

5. Execution Steps

Follow this exact sequence. Don’t reorder it. Determinism comes from fixed handoffs and gates.

Pick one channel for v1
- Choose exactly one: google_search_ads or outbound_email or linkedin_organic.
- Decision rule: pick the channel with the fastest path to shipping within your team today.
Create your “Company Context” doc (1 page)
- Product: what it does, who it’s for, top 3 use cases.
- ICP: roles, company size, geo.
- Differentiators: 3 bullets.
- Disallowed claims: anything you can’t prove.
- I did this at Postmates constantly: one source of truth stopped teams from inventing positioning mid-sprint.
Collect a raw data dump (30–90 minutes)
- Minimum viable: export top queries from Search Console last 28 days + 20 recent sales call notes.
- If you don’t have those: paste 50 support tickets or 50 “why are you switching?” snippets.
- Keep it messy. The Scout agent’s job is structure.
Run Scout Agent (Prompt 1)
- Input: company_context + data_dump.
- Output: scout_output_json.
- Gate: validate JSON parses. If it doesn’t parse, rerun Scout with “Output valid JSON only.”
Run Research Agent (Prompt 2)
- Input: company_context + scout_output_json + chosen channel.
- Output: themes+hypotheses+experiments YAML.
- Gate: ensure 3–6 themes, 2 hypotheses per theme, 3 experiments per hypothesis.
Select exactly ONE experiment
- Selection rubric (simple, deterministic):
  - Choose the experiment with:
    1. highest stated confidence, then
    2. lowest implementation time, then
    3. lowest risk_flags count.
- Don’t boil the ocean.
Run Copy Agent (Prompt 3)
- Provide brand_voice, never_say, and channel_specs.
- Output: copy_assets_json.
Run Compliance Agent (Prompt 4)
- Output: decision_json.
- If BLOCK: apply edits and rerun Compliance until APPROVE.
Publish
- Google Ads: create one ad group per hypothesis; pin headlines only if required.
- Outbound: create one sequence with 2 follow-ups; keep personalization minimal until it works.
- LinkedIn: schedule 5 posts; do not post all variants the same day.
Log the run

Save artifacts: scout JSON, research YAML, copy JSON, compliance JSON.
You need lineage for debugging. In practice, most “AI broke” moments are traceability problems.

Optional: Automate with CrewAI (Executable Code)

This gets you from prompts to a runnable pipeline with schema validation. Copy-paste and run.

# multi_agent_growth.py
# pip install crewai pydantic python-dotenv openai anthropic pyyaml

import os, json, yaml, datetime
from pydantic import BaseModel, Field, ValidationError
from crewai import Agent, Task, Crew, Process

RUN_ID = datetime.datetime.utcnow().strftime("%Y%m%d-%H%M%S")

class Signal(BaseModel):
    signal_id: str
    type: str
    summary: str
    evidence_quotes: list[str]
    audience: str
    job_to_be_done: str
    implied_stage: str
    confidence: float = Field(ge=0.0, le=1.0)
    assumptions: list[str]

class ScoutOutput(BaseModel):
    run_id: str
    generated_at: str
    signals: list[Signal]

def must_parse_json(s: str) -> dict:
    return json.loads(s.strip())

company_context = open("company_context.txt","r",encoding="utf-8").read()
data_dump = open("data_dump.txt","r",encoding="utf-8").read()
brand_voice = open("brand_voice.txt","r",encoding="utf-8").read()
never_say = open("never_say.txt","r",encoding="utf-8").read()
channel = os.getenv("CHANNEL","outbound_email")

scout_agent = Agent(
    role="Scout Agent",
    goal="Extract grounded intent signals as strict JSON.",
    backstory="Disciplined growth signal extraction with evidence quotes.",
    allow_delegation=False,
    verbose=True
)

research_agent = Agent(
    role="Research Agent",
    goal="Turn signals into themed hypotheses and runnable experiments in YAML.",
    backstory="Senior growth strategist who writes testable hypotheses.",
    allow_delegation=False,
    verbose=True
)

copy_agent = Agent(
    role="Copy Agent",
    goal="Generate channel-ready creative variants as strict JSON with claim extraction.",
    backstory="Performance copywriter who respects constraints and format.",
    allow_delegation=False,
    verbose=True
)

compliance_agent = Agent(
    role="Compliance Agent",
    goal="Block risky/unverifiable claims and enforce never_say list.",
    backstory="Brand safety reviewer with zero patience for sloppy claims.",
    allow_delegation=False,
    verbose=True
)

scout_task = Task(
    description=f"""
Return ONLY JSON per schema. run_id={RUN_ID}.
company_context:
{company_context}

data_dump:
{data_dump}
""",
    agent=scout_agent,
    expected_output="Valid JSON with signals"
)

research_task = Task(
    description=f"""
Return ONLY YAML per schema.
channel = {channel}

company_context:
{company_context}

scout_output_json:
{{{{SCOUT_JSON}}}}
""",
    agent=research_agent,
    expected_output="Valid YAML with themes/hypotheses/experiments"
)

copy_task = Task(
    description=f"""
Return ONLY JSON.
company_context:
{company_context}

brand_voice:
{brand_voice}

never_say:
{never_say}

experiment_yaml:
{{{{ONE_EXPERIMENT_YAML}}}}

channel_specs:
- outbound_email: subject<=60 chars, body<=120 words, plain text
- google_search_ads: headlines<=30 chars, descriptions<=90 chars
- linkedin_organic: <=1800 chars, hook in first 200
""",
    agent=copy_agent,
    expected_output="Valid JSON with 5 variants and claims array"
)

compliance_task = Task(
    description=f"""
Return ONLY JSON.
Inputs:
never_say:
{never_say}

copy_assets_json:
{{{{COPY_JSON}}}}
""",
    agent=compliance_agent,
    expected_output="APPROVE or BLOCK with required edits"
)

crew = Crew(
    agents=[scout_agent, research_agent, copy_agent, compliance_agent],
    tasks=[scout_task],
    process=Process.sequential
)

scout_raw = crew.kickoff()
scout_json = must_parse_json(str(scout_raw))
try:
    ScoutOutput(**scout_json)
except ValidationError as e:
    raise SystemExit(f"Scout output failed schema validation: {e}")

# Deterministic experiment selection happens outside the agent:
# Pick highest confidence signal theme manually or implement a selector.
open(f"artifacts/{RUN_ID}_scout.json","w",encoding="utf-8").write(json.dumps(scout_json, indent=2))
print("Saved scout artifact. Next: run research prompt with SCOUT_JSON inserted.")

Deterministic note: I keep experiment selection outside the model for v1. Models are good at generating options; they’re worse at consistent ranking unless you strictly specify scoring and parse it. If you want full automation, add a “Selector” agent with a fixed scoring formula and JSON output.

6. Output Schema

This is the strict artifact format you should store per run. If you keep this consistent, you can replay runs, diff outputs, and train internal standards.

{
  "run_metadata": {
    "run_id": "20260219-153000",
    "channel": "outbound_email",
    "owner": "name@company.com",
    "created_at": "2026-02-19T15:30:00Z"
  },
  "inputs": {
    "company_context_doc": "sha256:...",
    "data_dump_doc": "sha256:...",
    "brand_voice_doc": "sha256:...",
    "never_say_doc": "sha256:..."
  },
  "stage_outputs": {
    "scout": {
      "signals": [
        {
          "signal_id": "S-001",
          "type": "pain_point",
          "summary": "Ops teams can’t reconcile invoices across vendors",
          "evidence_quotes": ["..."],
          "audience": "Ops manager, mid-market",
          "job_to_be_done": "Close books faster without manual reconciliation",
          "implied_stage": "evaluating",
          "confidence": 0.7,
          "assumptions": ["Assumes invoice reconciliation is a core workflow"]
        }
      ]
    },
    "research": {
      "themes": [
        {
          "theme_id": "T-1",
          "name": "Manual reconciliation pain",
          "hypotheses": [
            {
              "hypothesis_id": "H-1A",
              "experiments": [
                {
                  "experiment_id": "E-1",
                  "channel": "outbound_email",
                  "targeting": "Controllers at 200-2000 employee companies",
                  "offer": "15-min teardown of your reconciliation workflow",
                  "kill_criterion": "If <X replies after Y sends",
                  "measurement": "reply rate + meeting rate",
                  "risk_flags": []
                }
              ]
            }
          ]
        }
      ]
    },
    "copy": {
      "assets": [
        {
          "asset_type": "email",
          "variants": [
            {
              "variant_id": "V1",
              "headline": "Quick question on reconciliation",
              "primary_text": "…",
              "cta": "Open to a 15-min teardown?",
              "claims": ["…"],
              "compliance_notes": []
            }
          ]
        }
      ]
    },
    "compliance": {
      "decision": "APPROVE",
      "approved_variant_ids": ["V1", "V2"]
    }
  }
}

7. QA Rubric

Run this after each stage. If you skip it, you will ship junk faster.

Stage	Check	Pass/Fail Criteria	Score (0-5)	Threshold
Scout	Evidence grounding	Every signal has ≥1 direct quote from data_dump	0-5	≥4
Scout	Deduplication	No two signals describe the same intent in different words	0-5	≥4
Scout	Coverage	8–15 signals; includes at least 3 pain points and 2 objections (if present in data)	0-5	≥3
Research	Hypothesis testability	Hypotheses have mechanism + audience + falsifiable outcome	0-5	≥4
Research	Experiment executability	Each experiment runnable in <2 days with your stack	0-5	≥4
Copy	Format compliance	Meets channel_specs limits; provides 5 variants	0-5	≥5
Copy	Claim extraction	All factual claims listed in `claims` array	0-5	≥4
Compliance	Safety	No unverifiable/regulated claims ship	0-5	≥5
Overall	Determinism	Outputs match schemas; no prose in JSON/YAML	0-5	≥5

Operational rule: if any stage fails threshold, you rerun only that stage with corrected inputs. Don’t regenerate upstream unless the upstream artifact is wrong.

8. Failure Modes

These are the ones I see every week when teams spin up multi-agent ai marketing systems.

Agents “helpfully” add prose around JSON

Symptom: parsing errors, broken automation.
Fix: Put “Output ONLY valid JSON. No prose.” at top, and re-run. If still happens, add “If you output anything other than JSON, the run fails.”

Scout invents intent not present in the dump

Symptom: signals sound plausible but lack hard quotes.
Fix: enforce evidence_quotes requirement and fail if empty. In practice, I also cap Scout to 15 signals to reduce creativity drift.

Research outputs generic experiments

Symptom: “Run A/B test messaging” without targeting, offer, or kill criteria.
Fix: require fields: targeting, offer, creative_brief, kill_criterion, measurement. If any missing, fail the stage.

Copy outputs “marketing filler” or banned words

Symptom: vague promises, fluffy adjectives, off-brand tone.
Fix: add never_say list and enforce it in Compliance. Also pass 3 “good examples” and 3 “bad examples” into Copy input if your brand is picky.

Compliance blocks everything

Symptom: Compliance becomes a brake because product claims aren’t documented.
Fix: create a “claims library” doc with what you can safely say (pricing, integrations, guarantees = none unless legal approves). At Uber, this was effectively a whitelist; it sped up launches.

Orchestrator can’t choose an experiment consistently

Symptom: different runs pick different experiments.
Fix: deterministic selection rule (highest confidence → lowest effort → lowest risk_flags). Keep the selector outside the model or force a numeric scoring formula.

Pipeline produces assets but no shipping behavior

Symptom: lots of docs, no campaigns live.
Fix: define “Done” as published artifacts + logged run + next run scheduled. A growth system that doesn’t ship is a content generator.

9. Iteration Loop

Run this loop 3 times before you add more agents.

Daily run cadence (or 3x/week)
- Same inputs types, same schemas, same QA. Consistency beats novelty early.
Add one new data source per week
- Week 1: Search Console queries
- Week 2: CRM lost reasons
- Week 3: support tickets
- Each new source increases Scout signal quality without changing the rest of the system.
Create a “Gold Standard” library
- Store 10 approved experiments and 20 approved copy variants.
- Feed 2–3 of them into Copy as examples. Models track patterns well when you show them.
Tighten gates as you learn
- Add regex blocks for forbidden claims (e.g., “guarantee”, “cure”, “instant”).
- Add brand lint checks: reading level, tone constraints, punctuation rules.
Instrument outcomes
- Append performance back into artifacts:
  - sent, opens, replies, meetings (email)
  - impressions, CTR, CVR (ads)
- Next run: include last run’s results in Research input. That’s how the system stops repeating losing angles.
Only then expand agents
- Add a “Publisher” agent to format for tools (Google Ads Editor CSV, HubSpot sequence JSON).
- Add a “Debugger” agent that reads failures and proposes prompt fixes.

Frequently Asked Questions

What’s the smallest version of this I can ship today?

Scout → Research → Copy → Compliance for one outbound email experiment. Publish one 3-step sequence. Log artifacts in a folder with run_id timestamps.

Should I orchestrate with CrewAI or just run prompts manually?

Manual is faster for the first 1–3 runs. Once your prompts stabilize and schemas stop changing, automate with CrewAI or LangGraph so you can run on a schedule.

How do I keep multi-agent ai marketing outputs from going off-brand?

Keep a tight brand_voice doc and a strict never_say list, then enforce it in Compliance. Also store “approved examples” and feed them into Copy as few-shot context.

What’s the right number of agents?

Start with 4–5. More agents increases coordination overhead and failure surfaces unless you have strong schemas and gating.

What if Compliance blocks copy because we can’t substantiate claims?

Build a claims whitelist: integrations, customer types, security statements, and pricing language legal approves. Copy can only use items in the whitelist, everything else gets rewritten as a question or a qualitative benefit.

How do I ensure determinism across runs?

Freeze schemas, cap output counts (signals 8–15), and use deterministic selection rules outside the model. If you use APIs, set temperature low (0–0.3) and keep prompts versioned.

Frequently Asked Questions

What’s the smallest version of this I can ship today?

Scout → Research → Copy → Compliance for one outbound email experiment. Publish one 3-step sequence. Log artifacts in a folder with run_id timestamps.

Should I orchestrate with CrewAI or just run prompts manually?

Manual is faster for the first 1–3 runs. Once your prompts stabilize and schemas stop changing, automate with CrewAI or LangGraph so you can run on a schedule.

How do I keep multi-agent ai marketing outputs from going off-brand?

Keep a tight brand_voice doc and a strict never_say list, then enforce it in Compliance. Also store “approved examples” and feed them into Copy as few-shot context.

What’s the right number of agents?

Start with 4–5. More agents increases coordination overhead and failure surfaces unless you have strong schemas and gating.

What if Compliance blocks copy because we can’t substantiate claims?

How do I ensure determinism across runs?

Freeze schemas, cap output counts (signals 8–15), and use deterministic selection rules outside the model. If you use APIs, set temperature low (0–0.3) and keep prompts versioned.

Ready to build your AI growth engine?

I help CEOs use AI to build the growth engine their board is asking for.

Talk to Isaac

Build a Multi-Agent Growth System from Scratch

Contents

1. Objective

2. Inputs Required

3. Tool Stack

4. Prompt Pack

5. Execution Steps

Optional: Automate with CrewAI (Executable Code)

6. Output Schema

7. QA Rubric

8. Failure Modes

9. Iteration Loop

Frequently Asked Questions

What’s the smallest version of this I can ship today?

Should I orchestrate with CrewAI or just run prompts manually?

How do I keep multi-agent ai marketing outputs from going off-brand?

What’s the right number of agents?

What if Compliance blocks copy because we can’t substantiate claims?

How do I ensure determinism across runs?

Frequently Asked Questions

Related

Ready to build your AI growth engine?