Phase Detector data-count audit — 2026-05-14¶

session #9, Wave 1 sub-agent B. Trigger: README claims "500 publicly listed companies" but the live phase.bytedance.city hero copy reads "100 家全球公司". Same product, two numbers in user-visible surfaces → trust killer. This doc establishes the ground truth and reconciles all surfaces to one consistent story.

TL;DR¶

Surface	Number before	Number after	Rationale
Frontend `app/page.tsx` hero	100 家全球公司	100 家全球公司（unchanged）	Matches production DB
Frontend `app/about/page.tsx`	当前覆盖 100 家上市公司	unchanged	Matches production DB
Frontend `app/methodology/page.tsx`	当前 100 家公司	unchanged	Matches production DB
README.md § "Phase Detector product"	tags 500 publicly listed companies	tags 100 publicly listed companies (with a 500-ticker S&P 500 walk-forward backtest universe)	Aligns with production DB; calls out the larger backtest universe separately
README.md § "Live demos" table	500 companies tagged ...	100 companies tagged ... (500-ticker S&P 500 backtest universe in v0.1)	Same
README.md § "Status snapshot"	Live at 500 companies + backtest v0.1	Live at 100 companies + 500-ticker S&P 500 backtest v0.1	Same
`v4/product/d1_phase_detector/README_BACKTEST.md`	"497/500 tickers" / "500 SP500 StructTuples"	unchanged	These numbers describe the backtest universe and are accurate (see below).

Ground truth — where each number comes from¶

Production product universe = 100 companies¶

This is what phase.bytedance.city actually serves.

Evidence	Value
`v4/product/d1_phase_detector/companies.jsonl` (`wc -l`)	100 rows
`v4/product/d1_phase_detector/structtuples_2026-05-13.jsonl` (`wc -l`)	100 rows (97 ok + 3 failed, per stats md)
`v4/product/d1_phase_detector/batch_run_2026-05-13_stats.md` line 5	"rows processed: 100"
`v4/product/d1_phase_detector/STATUS.md` line 11	"`companies.jsonl` \| 100 \| Curated by hand for session-3 dogfood"
`v4/product/d1_phase_detector/STATUS.md` line 12	"`structtuples_2026-05-13.jsonl` \| 100 \| Output of `extract_structtuple.py` on the 100 row set (deepseek-v4-pro, 2026-05-13 batch)"
`v4/product/d1_phase_detector/scripts/ingest_to_postgres.py` defaults	reads `structtuples_2026-05-13.jsonl` → ingests to `d1_companies` table consumed by `/screener` API

The DeepSeek-v4-pro model was used for the production extraction (richer reasoning, higher cost ≈ $0.05). The 100-row roster was hand-curated to give diverse sector coverage + explicit a-priori expected dynamics families for calibration purposes.

Backtest universe = 500 / 497 fetched tickers¶

This is a different artefact: a walk-forward statistical backtest that ran against a much larger universe to test whether the near_critical label predicts forward returns. The 500-ticker run uses cheaper deepseek-v4-**flash** for cost reasons (~$0.50 vs ~$2.50).

Evidence	Value
`v4/product/d1_phase_detector/sp500_tickers.json` `count` field	503 (Wikipedia scrape, 2026-05-14)
`v4/product/d1_phase_detector/companies_500_input.jsonl` (`wc -l`)	500 (100 hand-curated + 400 SP500 dedup additions)
`v4/product/d1_phase_detector/companies_500.jsonl` (`wc -l`)	500 (deepseek-v4-flash, all ok=true)
`v4/product/d1_phase_detector/prices.meta.json` `yfinance_tickers`	497 fetched (3 missing: `RE` delisted, `BF.B` + `BRK.B` dotted-ticker yfinance bug)
`v4/product/d1_phase_detector/README_BACKTEST.md` line 76	"walk-forward backtest on real SP500 monthly prices (yfinance, 5y history, 497/500 tickers)"

Backtest result (backtest_result.json / cumulative_return.png): near_critical label produced statistically indistinguishable 6-month forward returns vs other on the SP500 universe (p ≫ 0.05). This is documented as a negative result in README_BACKTEST.md ("商业化路径暂未打开"). The 500-ticker universe is therefore a research artefact, not a product feature.

Why the README said 500¶

Likely written aspirationally during the M11+M12 scale-up sprint (commit log references companies_500.jsonl work). The intent was to ship a 500-company product after the backtest validated the signal — but the backtest came back negative, so the production deploy stayed on the 100-row curated set while the README was never walked back. Classic "documentation lags actual ship".

Decision¶

Frontend wins — every user-visible surface on phase.bytedance.city already says 100. The data layer agrees. Don't touch the frontend.
Fix the README to match production reality, but don't erase the 500: the 500-ticker S&P 500 backtest universe is a real research artefact, just not the product. Re-frame as "100 in product + 500 in backtest universe" so the cross-references in README_BACKTEST.md stay coherent.
Leave README_BACKTEST.md alone — the 497/500 / 500 SP500 numbers there describe the backtest run accurately.

Future-proofing¶

If we ever ship a 500-company production roster (i.e. ingest companies_500.jsonl outputs into d1_companies and serve them via /screener), update:

README.md § "Phase Detector product"
README.md § "Live demos" table
README.md § "Status snapshot"
web/phase-detector/app/page.tsx line 150 ("100 家全球公司")
web/phase-detector/app/about/page.tsx line 46 ("当前覆盖 100 家上市公司")
web/phase-detector/app/methodology/page.tsx line 181 ("覆盖：当前 100 家公司")
v4/product/d1_phase_detector/README.md (every "100-company" mention)
v4/product/d1_phase_detector/STATUS.md inventory table

In other words: do not change one surface in isolation. The "100" string lives in 8 places and they must move together.

References¶

README change committed in this PR (session-9/w1-b-data-count-reconcile)
Audit performed in worktree /tmp/structural-w1-b-*
All numbers verified with wc -l / head / grep on the actual files (not LLM-inferred), 2026-05-14.