LLM AB Test — /api/ask/stream Sonnet 4.6 vs DeepSeek Chat¶
Date: 2026-05-14 Session: #8 (session8/w2e-vps-llm) Tester: Local CC Endpoint: POST https://beta.structural.bytedance.city/api/ask/stream Trigger: User request to upgrade VPS structural-web ask LLM from deepseek/deepseek-chat to anthropic/claude-sonnet-4.6.
Setup¶
- Plumbing:
web/backend/services/ask_orchestrator.py:45already readsASK_LLM_MODELenv var with fallbackdeepseek/deepseek-chat. No code change required. - VPS env before:
ASK_LLM_MODELunset → defaults todeepseek/deepseek-chatfor ask streaming. - VPS env after: appended
ASK_LLM_MODEL=anthropic/claude-sonnet-4.6to/root/Projects/structural-isomorphism/web/backend/.env(backup:.env.bak-1778767603),systemctl restart structural-web. /api/healthllm_modelfield: always reportedanthropic/claude-sonnet-4.6(this field describes the embedding/synthesis model, not the ask LLM — they were decoupled).- Streamed meta event
modelfield: pre =deepseek/deepseek-chat, post =anthropic/claude-sonnet-4.6✅ - Region note: VPS in Singapore → OpenRouter → Anthropic. No CN region-block (constraint applies only to China-region traffic; Anthropic via OpenRouter from SG is fine).
Test Methodology¶
- 3 representative queries (2 EN concept queries + 1 ZH analytic query). Same
lang=zhpayload for all 3, matching the prod default. - Each query run once per LLM (4 measurements = 2 LLMs × 2 phases would be more rigorous, but smoke AB is sufficient signal for this go/no-go decision).
- Raw SSE captured to
/tmp/llm-ab-test/{baseline,sonnet}-q{1..3}.txt. - Quality eyeballed on: answer depth, citation density, KB grounding, structural-isomorphism framing (the product's whole point — cross-domain analogies).
Quantitative¶
| Metric | DeepSeek (baseline) | Sonnet 4.6 | Δ |
|---|---|---|---|
| Q1 answer chars | 244 | 617 | +153% |
| Q2 answer chars | 256 | 620 | +142% |
| Q3 answer chars | 326 | 652 | +100% |
| Q1 citations | 1 | 1 | = |
| Q2 citations | 1 | 2 | +1 |
| Q3 citations | 3 | 4 | +1 |
| SSE events per response | 37–47 | 84–88 | ~2x |
Qualitative¶
Q1 — "What is self-organized criticality and where does it appear in nature?"¶
- DeepSeek: Correct but textbook-level. Cited 1 KB entry (
5k-26-071) which is politically about decentralization risk (分权改革的俘获风险) — wrong domain, the citation is irrelevant to SOC. KB retrieval gave 5 cards but LLM picked the highest-similarity one without semantic check. - Sonnet 4.6: Mentioned Per Bak + 1987 origin, sandpile metaphor, Gutenberg-Richter, then contrasted SOC with
bio-025(Boolean phosphorylation switch) as structural opposite — this is exactly the product's structural-isomorphism mission. Citation is justified as a contrast. - Verdict: Sonnet much better (proper domain selection from KB cards, structural framing, scientifically accurate origin attribution).
Q2 — "How does the BTW sandpile model show power-law avalanches?"¶
- DeepSeek: Brief mechanism description. Cited
5k-09-080(OSPF flooding) as "类似于" (analogous). Mention is correct but shallow — no explanation of why it's an isomorphism. - Sonnet 4.6: Same OSPF citation + added
5k-09-097(PFC Pause Storm) as second isomorphic case. Explained the deep structural similarity (local threshold → neighbor cascade → no characteristic scale) and the key difference (OSPF terminates via sequence dedup, BTW terminates via dissipation boundary → bounded vs power-law). - Verdict: Sonnet much better (this is the product's killer feature — Sonnet uses it, DeepSeek doesn't).
Q3 — "为什么基于幂律分布的critical point判定在金融市场容易失效?"¶
- DeepSeek: Listed factors (market sentiment, policy, black swans) but largely surface-level. 3 citations, decent grounding.
- Sonnet 4.6: 4 distinct mechanisms (1) power-law index drift across regimes (2) hub-dominated networks ≠ SOC auto-criticality (3) extreme tail probability sensitivity to α (4) Goodhart-law observer effect. 4 citations, each with explanatory label connecting back to the structural argument.
- Verdict: Sonnet much better (multi-mechanism analysis with named principles like Goodhart's law).
Summary¶
- Q1: Sonnet better (DeepSeek picked wrong-domain KB citation)
- Q2: Sonnet better (Sonnet exploits structural-isomorphism framing, DeepSeek doesn't)
- Q3: Sonnet better (depth + named mechanisms)
Cost Estimate¶
Per ~5k-tok query (3k context + 2k generation, approximate): - DeepSeek Chat via OpenRouter: input $0.14/M + output \(0.28/M ≈ **\)0.0014/query** (rounded to $0.001). - Claude Sonnet 4.6 via OpenRouter: input $3/M + output \(15/M ≈ **\)0.039/query** (rounded to $0.04). - Multiplier: ~28x more expensive per query. - Daily cost at 100 queries/day: $0.10 → \(4.00 (additional ~\)120/month). - Daily cost at 1000 queries/day: $1 → \(40 (additional ~\)1200/month).
Recommendation: KEEP (with tier-aware safeguard for scale)¶
Rationale: 1. Quality differential is large and consistent — Sonnet wins 3/3 on the product's core promise (cross-domain structural analogy). DeepSeek occasionally picks wrong-domain KB citations, which is a credibility-breaking error for a "structural isomorphism" engine. 2. Current scale is small (private beta, <100 queries/day) — $4/day at most is negligible vs the quality lift. 3. Scale plan: when traffic exceeds 500 queries/day, revisit with tier-aware routing — free tier → DeepSeek, paid/research tier → Sonnet. The auth_tier middleware already exists; routing in ask_orchestrator.py is a 10-line patch. 4. Latency: Sonnet streamed 2x more events but felt similar end-to-end (qualitative). No SLA breach observed. 5. Revert mechanism: simple unset ASK_LLM_MODEL (or comment out the .env line) + restart → instant rollback.
Action Items¶
- Append
ASK_LLM_MODEL=anthropic/claude-sonnet-4.6to VPS.env(backup kept as.env.bak-1778767603). - Restart
structural-websystemd unit, verify/api/healthand meta event. - Run AB test on 3 queries, save raw SSE to
/tmp/llm-ab-test/. - Write this report.
- At 500 q/day: implement tier-aware ask LLM routing (free=DeepSeek, paid=Sonnet).
- Quarterly: re-AB-test as new DeepSeek/Sonnet versions ship.
Artifacts¶
- Raw SSE:
/tmp/llm-ab-test/{baseline,sonnet}-q{1..3}.txt(local Mac, not committed). - Comparison summaries:
/tmp/llm-ab-test/q{1,2,3}.compare.txt. - Extract script:
/tmp/llm-ab-test/extract.py. - VPS env backup:
/root/Projects/structural-isomorphism/web/backend/.env.bak-1778767603.