Most of what 3Dogs does happens behind one PDF. This time we ran it in deep mode โ the setting that forces multiple independently-composed panels and won't skip the hard questions โ and logged every stage: the research, the follow-ups, the disagreement, and a moment where the system caught a flaw in its own output.
Deep mode doesn't just run one debate for longer โ it re-runs research when the debate is unresolved, forces real clarifying questions instead of guessing, and composes a fresh panel each time. Here's the actual log.
Discovery didn't just gather facts once. It built a research plan (classifying the decision type and its stakes), ran targeted web-grounded research against that plan, then repeated the whole pass three more times โ once after each round of clarifying answers, and once more after the debate itself flagged what was still unresolved. That last pass didn't just re-ask; it fed the panel's own live disagreements about reimbursement risk and physician retention back into fresh research questions before anyone answered them.
| Discovery pass | Triggered by | Stakes classification | Research angles run |
|---|---|---|---|
| 1 โ intake | Initial submission | 4/5 ยท build-vs-buy | 5, plus 1 targeted follow-up |
| 2 โ after round 1 | Client clarification #1 | 5/5 ยท reclassified as market-entry / M&A | 5, plus 1 targeted follow-up |
| 3 โ after round 2 | Client clarification #2 | 5/5 ยท build-vs-buy | 5 โ completeness hit 100% |
| 4 โ after the debate | Post-Nexus clarification, informed by unresolved panel dissent | Same brief, enriched twice mid-debate with panel-generated research questions | 2 new targeted questions per enrichment cycle |
Deep mode won't let a thin brief through. It asked two rounds before Nexus even started, then one more after the first debate, each time targeting the specific number the panel needed and had not yet gotten.
Deep mode doesn't re-run the same panel for confidence theater. Each panel was freshly composed โ different size, different bench models pulled in, different seat framing โ and each one argued the case from scratch before a fourth pass (Wisdom-of-Crowds) reconciled all three.
| Ensemble | Panel | Seats | Notable roster | Result |
|---|---|---|---|---|
| Cycle 1 | 1 | 11 | Kimi K2, Qwen3-235B, Nova 2 Lite, Nova Micro | Split โ REJECT vs. APPROVE_WITH_MODIFICATIONS, 0% raw agreement |
| 2 | 13 | Nova Lite, Haiku 4.5, Qwen3-32B, Palmyra X5, Pixtral Large | 79% conf ยท DEFER minority surfaced | |
| 3 | 12 | GLM-5, Kimi K2, Nova Lite, Nova Micro | 82% conf ยท 100% agreement, one DEFER holdout | |
| โ Cycle 1 meta-synthesis: APPROVE_WITH_MODIFICATIONS ยท 78% confidence ยท 0.83 cross-panel agreement | ||||
| Cycle 2 (post-clarification rerun) | 1 | 11 | Kimi K2, Qwen3-235B, Nova Lite | 82% conf, after 2 initial REJECTs flipped under challenge |
| 2 | 15 | Qwen3-32B, Pixtral Large, GLM-5, Palmyra X5 (Red-Team Adversary), Haiku 4.5 | 79% conf ยท REJECT + DEFER minorities | |
| 3 | 11 | Nova 2 Lite, Nova Lite, Qwen3-32B, Nova Micro | 82% conf ยท 100% agreement, 5 analysts changed position | |
| โ Cycle 2 meta-synthesis: APPROVE_WITH_MODIFICATIONS ยท 74% confidence (discounted below panel average) ยท 0.67 cross-panel agreement | ||||
Large panels push against real-world API rate limits. Three different models (Llama 4 Maverick, Pixtral Large, Palmyra X5) got throttled mid-debate across the run. Each time, the system detected the failure, ejected that model for the remainder of the query, and automatically rerouted its seat to Haiku 4.5 rather than dropping the voice or crashing the debate. On the largest panel (15 seats, three models rerouted to Haiku simultaneously) Haiku itself briefly got congested and a handful of individual challenge exchanges failed outright โ the debate absorbed the loss and kept going instead of stalling.
The panels' own average confidence across the second ensemble was 82% / 79% / 82%. The meta-synthesis didn't just report that average โ it identified that all three panels converged on the same structural fault line (whether cardiologist retention can actually be secured before capital is spent) and discounted the final confidence to 74% to reflect that unresolved risk.
After the PDF was delivered, 3Dogs' own Evolution process reviewed the case and flagged something genuinely useful: the pipeline's modification-list dedupe only catches exact-string duplicates, so two panelists issuing literally opposite instructions ("require binding retention agreements first" vs. "don't wait for binding contracts") could both survive to the client-facing report unreconciled. It filed that as a HIGH-severity fix proposal โ propose-only, nothing changed automatically โ and this exact demo query is what surfaced it.
A single model answering once, a canned demo script, or a case where dissent got quietly averaged into a tidy number. Every stage above came out of the actual run logs โ including the parts that show real strain (the 429 throttling) and a real, previously-uncaught gap (the modifications dedupe).
Not every decision needs this. Deep mode is the setting for capital commitments, board-level calls, and anything where "the model sounded confident" isn't a good enough reason โ where you want the disagreement on the record, the confidence explicitly justified (or discounted), and a documented basis for the number you're about to defend.
The full PDF this run produced โ case 2026-0078, generated on 3Dogs' AWS development environment.