Inside Deep Mode — 18 Minutes, Two Full Ensembles, One Decisive Call

What actually happened, stage by stage

Deep mode doesn't just run one debate for longer — it re-runs research when the debate is unresolved, forces real clarifying questions instead of guessing, and composes a fresh panel each time. Here's the actual log.

Adaptive research ran four separate times — each one informed by what came before

Discovery didn't just gather facts once. It built a research plan (classifying the decision type and its stakes), ran targeted web-grounded research against that plan, then repeated the whole pass three more times — once after each round of clarifying answers, and once more after the debate itself flagged what was still unresolved. That last pass didn't just re-ask; it fed the panel's own live disagreements about reimbursement risk and physician retention back into fresh research questions before anyone answered them.

Discovery pass	Triggered by	Stakes classification	Research angles run
1 — intake	Initial submission	4/5 · build-vs-buy	5, plus 1 targeted follow-up
2 — after round 1	Client clarification #1	5/5 · reclassified as market-entry / M&A	5, plus 1 targeted follow-up
3 — after round 2	Client clarification #2	5/5 · build-vs-buy	5 — completeness hit 100%
4 — after the debate	Post-Nexus clarification, informed by unresolved panel dissent	Same brief, enriched twice mid-debate with panel-generated research questions	2 new targeted questions per enrichment cycle

The clarification loop asked real questions — and didn't repeat itself

Deep mode won't let a thin brief through. It asked two rounds before Nexus even started, then one more after the first debate, each time targeting the specific number the panel needed and had not yet gotten.

Round 1: "Does your hospital currently have the $8–11M available or pre-approved... and at roughly what interest rate?"

Answered: ~$5M cash reserves plus bond capacity at an estimated 5.4% cost of capital.

Round 2: "What percentage of your 1,400 cath procedures were billed outpatient vs. inpatient?"

Answered: ~65% outpatient — meaning the proposed CMS reimbursement cut would hit the majority of the hospital's volume.

Post-Nexus round (asked only after the panel debated and still couldn't resolve it): "Do the two senior cardiologists have signed, binding retention agreements... before capital is committed?"

Answered: no binding agreement existed yet — which became the central tension in the final call.

Two complete 3-panel ensembles ran — six panels, no two alike

Deep mode doesn't re-run the same panel for confidence theater. Each panel was freshly composed — different size, different bench models pulled in, different seat framing — and each one argued the case from scratch before a fourth pass (Wisdom-of-Crowds) reconciled all three.

Ensemble	Panel	Seats	Notable roster	Result
Cycle 1	1	11	Kimi K2, Qwen3-235B, Nova 2 Lite, Nova Micro	Split — REJECT vs. APPROVE_WITH_MODIFICATIONS, 0% raw agreement
	2	13	Nova Lite, Haiku 4.5, Qwen3-32B, Palmyra X5, Pixtral Large	79% conf · DEFER minority surfaced
	3	12	GLM-5, Kimi K2, Nova Lite, Nova Micro	82% conf · 100% agreement, one DEFER holdout
→ Cycle 1 meta-synthesis: APPROVE_WITH_MODIFICATIONS · 78% confidence · 0.83 cross-panel agreement
Cycle 2 (post-clarification rerun)	1	11	Kimi K2, Qwen3-235B, Nova Lite	82% conf, after 2 initial REJECTs flipped under challenge
	2	15	Qwen3-32B, Pixtral Large, GLM-5, Palmyra X5 (Red-Team Adversary), Haiku 4.5	79% conf · REJECT + DEFER minorities
	3	11	Nova 2 Lite, Nova Lite, Qwen3-32B, Nova Micro	82% conf · 100% agreement, 5 analysts changed position
→ Cycle 2 meta-synthesis: APPROVE_WITH_MODIFICATIONS · 74% confidence (discounted below panel average) · 0.67 cross-panel agreement

⚑ Dissent was preserved, not smoothed over, every single round. Every panel that split triggered a "high-confidence dissent" flag — a structural check that forces the synthesis to treat a confident minority objection as a live one, not noise to average away.

Real infrastructure limits showed up — and the system worked around them live

Large panels push against real-world API rate limits. Three different models (Llama 4 Maverick, Pixtral Large, Palmyra X5) got throttled mid-debate across the run. Each time, the system detected the failure, ejected that model for the remainder of the query, and automatically rerouted its seat to Haiku 4.5 rather than dropping the voice or crashing the debate. On the largest panel (15 seats, three models rerouted to Haiku simultaneously) Haiku itself briefly got congested and a handful of individual challenge exchanges failed outright — the debate absorbed the loss and kept going instead of stalling.

3 models throttled and auto-ejected Zero panels failed to complete Known limit: shared account-level rate quota, actively being raised

This is a known, tracked limitation, not a surprise — it's on our roadmap. We already have an active engagement with AWS to raise these account-level rate limits, with a live call scheduled specifically to size up additional processing throughput. And because of 3Dogs' own Evolution process — the same self-auditing system that caught a real gap in its own output below — the platform is measurably better every day, not just at release time. This case study is today's ceiling. It won't be tomorrow's.

The final call — and an honest confidence number, not an averaged one

The panels' own average confidence across the second ensemble was 82% / 79% / 82%. The meta-synthesis didn't just report that average — it identified that all three panels converged on the same structural fault line (whether cardiologist retention can actually be secured before capital is spent) and discounted the final confidence to 74% to reflect that unresolved risk.

● The Call

"Rebuild the cath lab — commit to a phased $8M renovation now and lock mobile coverage for 18 months while construction runs."

Recommendation confidence74% · APPROVE_WITH_MODIFICATIONS

Decision-context "rigor" trio all fired: tradeoff named, valuation bridge shown, dominant risk isolated Evidence classifier caught 1 contradicted claim and applied a confidence penalty

Bonus: the system audited itself — and found a real gap

After the PDF was delivered, 3Dogs' own Evolution process reviewed the case and flagged something genuinely useful: the pipeline's modification-list dedupe only catches exact-string duplicates, so two panelists issuing literally opposite instructions ("require binding retention agreements first" vs. "don't wait for binding contracts") could both survive to the client-facing report unreconciled. It filed that as a HIGH-severity fix proposal — propose-only, nothing changed automatically — and this exact demo query is what surfaced it.

Self-flagged, not client-flaggedFix proposed, not auto-appliedHuman review required before any change ships

We forced the deepest setting on a real capital decision — and watched everything it did.

What actually happened, stage by stage

Adaptive research ran four separate times — each one informed by what came before

The clarification loop asked real questions — and didn't repeat itself

Two complete 3-panel ensembles ran — six panels, no two alike

Real infrastructure limits showed up — and the system worked around them live

The final call — and an honest confidence number, not an averaged one

Bonus: the system audited itself — and found a real gap

What this is NOT

What deep mode is for

It didn't converge because we asked it to. It converged because the evidence did.

See the actual delivered report