๐Ÿพ3Dogs NexusStructured Decision Intelligence
Case study ยท inside deep mode

We forced the deepest setting on a real capital decision โ€” and watched everything it did.

Most of what 3Dogs does happens behind one PDF. This time we ran it in deep mode โ€” the setting that forces multiple independently-composed panels and won't skip the hard questions โ€” and logged every stage: the research, the follow-ups, the disagreement, and a moment where the system caught a flaw in its own output.

The decision, word for word: โ€œWe run a 200-bed regional hospital. Our cardiac catheterization lab is 12 years old and needs either a full renovation and equipment refresh (estimated $8โ€“11M capital) or we could contract cath lab services to a mobile provider... The board wants a recommendation on capital investment vs. contracting within the current fiscal year. What should we do?โ€
Capital investment ยท healthcare Deep mode โ€” forced 18 minutes ยท 2 full ensembles ยท 6 panels
This is a simulated demonstration case, not a live client engagement โ€” run specifically to show deep mode's internals end to end on our development environment. The clarification answers were supplied by us, playing the hospital's side, so the mechanics below are exactly what a real deep-mode case produces; the underlying scenario is illustrative rather than an actual hospital's numbers.
18m 2s
Total runtime
6
Independently-composed debate panels
17
Distinct models seated
~1,000
API calls, research through synthesis

What actually happened, stage by stage

Deep mode doesn't just run one debate for longer โ€” it re-runs research when the debate is unresolved, forces real clarifying questions instead of guessing, and composes a fresh panel each time. Here's the actual log.

1

Adaptive research ran four separate times โ€” each one informed by what came before

Discovery didn't just gather facts once. It built a research plan (classifying the decision type and its stakes), ran targeted web-grounded research against that plan, then repeated the whole pass three more times โ€” once after each round of clarifying answers, and once more after the debate itself flagged what was still unresolved. That last pass didn't just re-ask; it fed the panel's own live disagreements about reimbursement risk and physician retention back into fresh research questions before anyone answered them.

Discovery passTriggered byStakes classificationResearch angles run
1 โ€” intakeInitial submission4/5 ยท build-vs-buy5, plus 1 targeted follow-up
2 โ€” after round 1Client clarification #15/5 ยท reclassified as market-entry / M&A5, plus 1 targeted follow-up
3 โ€” after round 2Client clarification #25/5 ยท build-vs-buy5 โ€” completeness hit 100%
4 โ€” after the debatePost-Nexus clarification, informed by unresolved panel dissentSame brief, enriched twice mid-debate with panel-generated research questions2 new targeted questions per enrichment cycle
2

The clarification loop asked real questions โ€” and didn't repeat itself

Deep mode won't let a thin brief through. It asked two rounds before Nexus even started, then one more after the first debate, each time targeting the specific number the panel needed and had not yet gotten.

Round 1: "Does your hospital currently have the $8โ€“11M available or pre-approved... and at roughly what interest rate?"
Answered: ~$5M cash reserves plus bond capacity at an estimated 5.4% cost of capital.
Round 2: "What percentage of your 1,400 cath procedures were billed outpatient vs. inpatient?"
Answered: ~65% outpatient โ€” meaning the proposed CMS reimbursement cut would hit the majority of the hospital's volume.
Post-Nexus round (asked only after the panel debated and still couldn't resolve it): "Do the two senior cardiologists have signed, binding retention agreements... before capital is committed?"
Answered: no binding agreement existed yet โ€” which became the central tension in the final call.
3

Two complete 3-panel ensembles ran โ€” six panels, no two alike

Deep mode doesn't re-run the same panel for confidence theater. Each panel was freshly composed โ€” different size, different bench models pulled in, different seat framing โ€” and each one argued the case from scratch before a fourth pass (Wisdom-of-Crowds) reconciled all three.

EnsemblePanelSeatsNotable rosterResult
Cycle 1111Kimi K2, Qwen3-235B, Nova 2 Lite, Nova MicroSplit โ€” REJECT vs. APPROVE_WITH_MODIFICATIONS, 0% raw agreement
213Nova Lite, Haiku 4.5, Qwen3-32B, Palmyra X5, Pixtral Large79% conf ยท DEFER minority surfaced
312GLM-5, Kimi K2, Nova Lite, Nova Micro82% conf ยท 100% agreement, one DEFER holdout
โ†’ Cycle 1 meta-synthesis: APPROVE_WITH_MODIFICATIONS ยท 78% confidence ยท 0.83 cross-panel agreement
Cycle 2 (post-clarification rerun)111Kimi K2, Qwen3-235B, Nova Lite82% conf, after 2 initial REJECTs flipped under challenge
215Qwen3-32B, Pixtral Large, GLM-5, Palmyra X5 (Red-Team Adversary), Haiku 4.579% conf ยท REJECT + DEFER minorities
311Nova 2 Lite, Nova Lite, Qwen3-32B, Nova Micro82% conf ยท 100% agreement, 5 analysts changed position
โ†’ Cycle 2 meta-synthesis: APPROVE_WITH_MODIFICATIONS ยท 74% confidence (discounted below panel average) ยท 0.67 cross-panel agreement
โš‘ Dissent was preserved, not smoothed over, every single round. Every panel that split triggered a "high-confidence dissent" flag โ€” a structural check that forces the synthesis to treat a confident minority objection as a live one, not noise to average away.
4

Real infrastructure limits showed up โ€” and the system worked around them live

Large panels push against real-world API rate limits. Three different models (Llama 4 Maverick, Pixtral Large, Palmyra X5) got throttled mid-debate across the run. Each time, the system detected the failure, ejected that model for the remainder of the query, and automatically rerouted its seat to Haiku 4.5 rather than dropping the voice or crashing the debate. On the largest panel (15 seats, three models rerouted to Haiku simultaneously) Haiku itself briefly got congested and a handful of individual challenge exchanges failed outright โ€” the debate absorbed the loss and kept going instead of stalling.

3 models throttled and auto-ejected Zero panels failed to complete Known limit: shared account-level rate quota, actively being raised
This is a known, tracked limitation, not a surprise โ€” it's on our roadmap. We already have an active engagement with AWS to raise these account-level rate limits, with a live call scheduled specifically to size up additional processing throughput. And because of 3Dogs' own Evolution process โ€” the same self-auditing system that caught a real gap in its own output below โ€” the platform is measurably better every day, not just at release time. This case study is today's ceiling. It won't be tomorrow's.
5

The final call โ€” and an honest confidence number, not an averaged one

The panels' own average confidence across the second ensemble was 82% / 79% / 82%. The meta-synthesis didn't just report that average โ€” it identified that all three panels converged on the same structural fault line (whether cardiologist retention can actually be secured before capital is spent) and discounted the final confidence to 74% to reflect that unresolved risk.

โ— The Call
"Rebuild the cath lab โ€” commit to a phased $8M renovation now and lock mobile coverage for 18 months while construction runs."
Recommendation confidence74% ยท APPROVE_WITH_MODIFICATIONS
Decision-context "rigor" trio all fired: tradeoff named, valuation bridge shown, dominant risk isolated Evidence classifier caught 1 contradicted claim and applied a confidence penalty
6

Bonus: the system audited itself โ€” and found a real gap

After the PDF was delivered, 3Dogs' own Evolution process reviewed the case and flagged something genuinely useful: the pipeline's modification-list dedupe only catches exact-string duplicates, so two panelists issuing literally opposite instructions ("require binding retention agreements first" vs. "don't wait for binding contracts") could both survive to the client-facing report unreconciled. It filed that as a HIGH-severity fix proposal โ€” propose-only, nothing changed automatically โ€” and this exact demo query is what surfaced it.

Self-flagged, not client-flaggedFix proposed, not auto-appliedHuman review required before any change ships

What this is NOT

A single model answering once, a canned demo script, or a case where dissent got quietly averaged into a tidy number. Every stage above came out of the actual run logs โ€” including the parts that show real strain (the 429 throttling) and a real, previously-uncaught gap (the modifications dedupe).

What deep mode is for

Not every decision needs this. Deep mode is the setting for capital commitments, board-level calls, and anything where "the model sounded confident" isn't a good enough reason โ€” where you want the disagreement on the record, the confidence explicitly justified (or discounted), and a documented basis for the number you're about to defend.

See the actual delivered report

The full PDF this run produced โ€” case 2026-0078, generated on 3Dogs' AWS development environment.