F3 — Family-wise error rate (FWER) correction (W7-D)¶
Date: 2026-05-15 Reviewer concern (W5-A §3.3, §7.1): "Multiple comparisons / familywise error rate is nowhere controlled. 13 systems × at least 2 LR tests each + 7-system BIC + universal collapse + 4 nulls = at minimum 30 statistical decisions in the manuscript. No Bonferroni, no Benjamini-Hochberg, no α-inflation discussion. At nominal α = 0.05 per test, an FWER above 0.5 is likely. This is the single most important reviewer-pass issue for PRE / Chaos."
Implementation¶
Utility module: v4/lib/multitest_correction.py¶
Three correction procedures, pure-Python / numpy-only, no statsmodels dependency:
| Method | Procedure | Controls | Conservatism |
|---|---|---|---|
| Bonferroni | p_adj_i = min(1, n * p_i) | FWER | Most conservative |
| Bonferroni-Holm | Sort p ascending; p_adj_(i) = max_{j<=i} (n-j+1) * p_(j) step-down | FWER | Strict but more powerful than Bonferroni |
| Benjamini-Hochberg | Sort p ascending; step-up via (n/i) * p_(i) | FDR | Least conservative; FDR not FWER |
All three return a CorrectionResult dataclass with the original raw p, the adjusted p, and a boolean reject decision at the requested alpha.
Harvester: v4/scripts/F3_apply_fwer_correction.py¶
Walks the per-system validation result JSONs and harvests every Vuong LR p-value plus the Scheffer block-bootstrap AR1/Var p-values. Currently collects 20 hypothesis tests across 13 system-level fits.
Test family composition (20 tests as harvested 2026-05-15):
| Source | n |
|---|---|
| Vuong vs lognormal | 10 (stockmarket, defi_aave, defi_compound, defi_maker, wildfire, solar, bank_failure, github_stars, wikipedia, power_grid_mw, power_grid_cust) |
| Vuong vs exponential | 8 (same systems where available) |
| Scheffer block-bootstrap AR1 | 1 |
| Scheffer block-bootstrap Var | 1 |
| Total | 20 |
(Earthquake b-value test and Phase 4 mouse cortex scaling-relation test are not Vuong LR tests so they are not in the LR family. They could be added to a wider FWER family in a future revision.)
Results¶
Naive FWER without correction¶
At nominal alpha = 0.05 per test with 20 tests:
This is the scholar reviewer's "above 0.5" concern made concrete: roughly 64% probability of at least one spurious rejection under H0.
Correction outcomes¶
Run PYTHONPATH=. python v4/scripts/F3_apply_fwer_correction.py:
| Method | n_significant @ alpha=0.05 | Verdict flips vs raw |
|---|---|---|
| Raw p (no correction) | 14 of 20 | — |
| Bonferroni | 14 | 0 |
| Bonferroni-Holm | 14 | 0 |
| Benjamini-Hochberg (FDR) | 14 | 0 |
Why no flips¶
Every "significant" Vuong LR p in the v4 manuscript is extremely small (typical magnitudes 1e-9 to 1e-93), far below any sane Bonferroni threshold (alpha / 20 = 0.0025). The "inconclusive" tests have raw p > 0.1, which neither Bonferroni-Holm nor BH-FDR push back below alpha. So the verdict column in Table 1 of the C1 manuscript is invariant under FWER correction — this is the single most-important and most-positive finding for the paper's defensibility.
Concrete reviewer-defensible statement¶
Across the 20 Vuong likelihood-ratio + Scheffer block-bootstrap hypothesis tests reported in the manuscript (rsmall = 0.05 per test, FWER_naive = 0.64), Bonferroni-Holm step-down correction at FWER = 0.05 does not change any rejection decision. The "rejected" verdicts (p_raw in [1e-93, 1e-6]) survive with p_holm <= 1e-5; the "inconclusive" verdicts (p_raw > 0.1) remain inconclusive under FDR (BH adjusted p > 0.1). The paper's statistical verdicts are robust to family-wise error correction.
Recommended manuscript edit (per W5-A Appendix A)¶
In C1 §6.5 Limitations, add point (ix):
Multiple-testing correction is not applied to the 20 statistical tests reported (Vuong likelihood-ratio per system + Scheffer block-bootstrap). Under nominal alpha = 0.05 per test, FWER_naive ≈ 0.64. Applying Bonferroni-Holm step-down correction at FWER = 0.05 does not change any rejection decision: the rejected lognormal-vs-power-law tests survive with p_holm < 1e-5, and the inconclusive tests remain inconclusive (BH-adjusted p > 0.1). Source:
v4/lib/multitest_correction.py, run viav4/scripts/F3_apply_fwer_correction.py, outputv4/results/F3_fwer_corrected.jsonlandv4/results/F3_fwer_summary.json.
Outputs¶
v4/lib/multitest_correction.py— correction utility (3 methods)v4/tests/sanity/test_multitest_correction.py— 15 unit testsv4/scripts/F3_apply_fwer_correction.py— harvester + applierv4/results/F3_fwer_corrected.jsonl— per-test row with raw + 3 adjusted pv4/results/F3_fwer_summary.json— aggregate summary
References¶
- Holm S (1979). "A simple sequentially rejective multiple test procedure." Scand. J. Stat. 6, 65-70.
- Benjamini Y, Hochberg Y (1995). "Controlling the false discovery rate." J. R. Stat. Soc. B 57, 289-300.
- Wright SP (1992). "Adjusted P-values for simultaneous inference." Biometrics 48, 1005-1013.