Skip to content

LLM AB Test — /api/ask/stream Sonnet 4.6 vs DeepSeek Chat

Date: 2026-05-14 Session: #8 (session8/w2e-vps-llm) Tester: Local CC Endpoint: POST https://beta.structural.bytedance.city/api/ask/stream Trigger: User request to upgrade VPS structural-web ask LLM from deepseek/deepseek-chat to anthropic/claude-sonnet-4.6.

Setup

  • Plumbing: web/backend/services/ask_orchestrator.py:45 already reads ASK_LLM_MODEL env var with fallback deepseek/deepseek-chat. No code change required.
  • VPS env before: ASK_LLM_MODEL unset → defaults to deepseek/deepseek-chat for ask streaming.
  • VPS env after: appended ASK_LLM_MODEL=anthropic/claude-sonnet-4.6 to /root/Projects/structural-isomorphism/web/backend/.env (backup: .env.bak-1778767603), systemctl restart structural-web.
  • /api/health llm_model field: always reported anthropic/claude-sonnet-4.6 (this field describes the embedding/synthesis model, not the ask LLM — they were decoupled).
  • Streamed meta event model field: pre = deepseek/deepseek-chat, post = anthropic/claude-sonnet-4.6
  • Region note: VPS in Singapore → OpenRouter → Anthropic. No CN region-block (constraint applies only to China-region traffic; Anthropic via OpenRouter from SG is fine).

Test Methodology

  • 3 representative queries (2 EN concept queries + 1 ZH analytic query). Same lang=zh payload for all 3, matching the prod default.
  • Each query run once per LLM (4 measurements = 2 LLMs × 2 phases would be more rigorous, but smoke AB is sufficient signal for this go/no-go decision).
  • Raw SSE captured to /tmp/llm-ab-test/{baseline,sonnet}-q{1..3}.txt.
  • Quality eyeballed on: answer depth, citation density, KB grounding, structural-isomorphism framing (the product's whole point — cross-domain analogies).

Quantitative

Metric DeepSeek (baseline) Sonnet 4.6 Δ
Q1 answer chars 244 617 +153%
Q2 answer chars 256 620 +142%
Q3 answer chars 326 652 +100%
Q1 citations 1 1 =
Q2 citations 1 2 +1
Q3 citations 3 4 +1
SSE events per response 37–47 84–88 ~2x

Qualitative

Q1 — "What is self-organized criticality and where does it appear in nature?"

  • DeepSeek: Correct but textbook-level. Cited 1 KB entry (5k-26-071) which is politically about decentralization risk (分权改革的俘获风险) — wrong domain, the citation is irrelevant to SOC. KB retrieval gave 5 cards but LLM picked the highest-similarity one without semantic check.
  • Sonnet 4.6: Mentioned Per Bak + 1987 origin, sandpile metaphor, Gutenberg-Richter, then contrasted SOC with bio-025 (Boolean phosphorylation switch) as structural opposite — this is exactly the product's structural-isomorphism mission. Citation is justified as a contrast.
  • Verdict: Sonnet much better (proper domain selection from KB cards, structural framing, scientifically accurate origin attribution).

Q2 — "How does the BTW sandpile model show power-law avalanches?"

  • DeepSeek: Brief mechanism description. Cited 5k-09-080 (OSPF flooding) as "类似于" (analogous). Mention is correct but shallow — no explanation of why it's an isomorphism.
  • Sonnet 4.6: Same OSPF citation + added 5k-09-097 (PFC Pause Storm) as second isomorphic case. Explained the deep structural similarity (local threshold → neighbor cascade → no characteristic scale) and the key difference (OSPF terminates via sequence dedup, BTW terminates via dissipation boundary → bounded vs power-law).
  • Verdict: Sonnet much better (this is the product's killer feature — Sonnet uses it, DeepSeek doesn't).

Q3 — "为什么基于幂律分布的critical point判定在金融市场容易失效?"

  • DeepSeek: Listed factors (market sentiment, policy, black swans) but largely surface-level. 3 citations, decent grounding.
  • Sonnet 4.6: 4 distinct mechanisms (1) power-law index drift across regimes (2) hub-dominated networks ≠ SOC auto-criticality (3) extreme tail probability sensitivity to α (4) Goodhart-law observer effect. 4 citations, each with explanatory label connecting back to the structural argument.
  • Verdict: Sonnet much better (multi-mechanism analysis with named principles like Goodhart's law).

Summary

  • Q1: Sonnet better (DeepSeek picked wrong-domain KB citation)
  • Q2: Sonnet better (Sonnet exploits structural-isomorphism framing, DeepSeek doesn't)
  • Q3: Sonnet better (depth + named mechanisms)

Cost Estimate

Per ~5k-tok query (3k context + 2k generation, approximate): - DeepSeek Chat via OpenRouter: input $0.14/M + output \(0.28/M ≈ **\)0.0014/query** (rounded to $0.001). - Claude Sonnet 4.6 via OpenRouter: input $3/M + output \(15/M ≈ **\)0.039/query** (rounded to $0.04). - Multiplier: ~28x more expensive per query. - Daily cost at 100 queries/day: $0.10 → \(4.00 (additional ~\)120/month). - Daily cost at 1000 queries/day: $1 → \(40 (additional ~\)1200/month).

Recommendation: KEEP (with tier-aware safeguard for scale)

Rationale: 1. Quality differential is large and consistent — Sonnet wins 3/3 on the product's core promise (cross-domain structural analogy). DeepSeek occasionally picks wrong-domain KB citations, which is a credibility-breaking error for a "structural isomorphism" engine. 2. Current scale is small (private beta, <100 queries/day) — $4/day at most is negligible vs the quality lift. 3. Scale plan: when traffic exceeds 500 queries/day, revisit with tier-aware routing — free tier → DeepSeek, paid/research tier → Sonnet. The auth_tier middleware already exists; routing in ask_orchestrator.py is a 10-line patch. 4. Latency: Sonnet streamed 2x more events but felt similar end-to-end (qualitative). No SLA breach observed. 5. Revert mechanism: simple unset ASK_LLM_MODEL (or comment out the .env line) + restart → instant rollback.

Action Items

  • Append ASK_LLM_MODEL=anthropic/claude-sonnet-4.6 to VPS .env (backup kept as .env.bak-1778767603).
  • Restart structural-web systemd unit, verify /api/health and meta event.
  • Run AB test on 3 queries, save raw SSE to /tmp/llm-ab-test/.
  • Write this report.
  • At 500 q/day: implement tier-aware ask LLM routing (free=DeepSeek, paid=Sonnet).
  • Quarterly: re-AB-test as new DeepSeek/Sonnet versions ship.

Artifacts

  • Raw SSE: /tmp/llm-ab-test/{baseline,sonnet}-q{1..3}.txt (local Mac, not committed).
  • Comparison summaries: /tmp/llm-ab-test/q{1,2,3}.compare.txt.
  • Extract script: /tmp/llm-ab-test/extract.py.
  • VPS env backup: /root/Projects/structural-isomorphism/web/backend/.env.bak-1778767603.