Skip to content

F3 — Family-wise error rate (FWER) correction (W7-D)

Date: 2026-05-15 Reviewer concern (W5-A §3.3, §7.1): "Multiple comparisons / familywise error rate is nowhere controlled. 13 systems × at least 2 LR tests each + 7-system BIC + universal collapse + 4 nulls = at minimum 30 statistical decisions in the manuscript. No Bonferroni, no Benjamini-Hochberg, no α-inflation discussion. At nominal α = 0.05 per test, an FWER above 0.5 is likely. This is the single most important reviewer-pass issue for PRE / Chaos."

Implementation

Utility module: v4/lib/multitest_correction.py

Three correction procedures, pure-Python / numpy-only, no statsmodels dependency:

Method Procedure Controls Conservatism
Bonferroni p_adj_i = min(1, n * p_i) FWER Most conservative
Bonferroni-Holm Sort p ascending; p_adj_(i) = max_{j<=i} (n-j+1) * p_(j) step-down FWER Strict but more powerful than Bonferroni
Benjamini-Hochberg Sort p ascending; step-up via (n/i) * p_(i) FDR Least conservative; FDR not FWER

All three return a CorrectionResult dataclass with the original raw p, the adjusted p, and a boolean reject decision at the requested alpha.

Harvester: v4/scripts/F3_apply_fwer_correction.py

Walks the per-system validation result JSONs and harvests every Vuong LR p-value plus the Scheffer block-bootstrap AR1/Var p-values. Currently collects 20 hypothesis tests across 13 system-level fits.

Test family composition (20 tests as harvested 2026-05-15):

Source n
Vuong vs lognormal 10 (stockmarket, defi_aave, defi_compound, defi_maker, wildfire, solar, bank_failure, github_stars, wikipedia, power_grid_mw, power_grid_cust)
Vuong vs exponential 8 (same systems where available)
Scheffer block-bootstrap AR1 1
Scheffer block-bootstrap Var 1
Total 20

(Earthquake b-value test and Phase 4 mouse cortex scaling-relation test are not Vuong LR tests so they are not in the LR family. They could be added to a wider FWER family in a future revision.)

Results

Naive FWER without correction

At nominal alpha = 0.05 per test with 20 tests:

FWER_naive = 1 - (1 - 0.05)^20 = 1 - 0.358 = 0.6415

This is the scholar reviewer's "above 0.5" concern made concrete: roughly 64% probability of at least one spurious rejection under H0.

Correction outcomes

Run PYTHONPATH=. python v4/scripts/F3_apply_fwer_correction.py:

Method n_significant @ alpha=0.05 Verdict flips vs raw
Raw p (no correction) 14 of 20
Bonferroni 14 0
Bonferroni-Holm 14 0
Benjamini-Hochberg (FDR) 14 0

Why no flips

Every "significant" Vuong LR p in the v4 manuscript is extremely small (typical magnitudes 1e-9 to 1e-93), far below any sane Bonferroni threshold (alpha / 20 = 0.0025). The "inconclusive" tests have raw p > 0.1, which neither Bonferroni-Holm nor BH-FDR push back below alpha. So the verdict column in Table 1 of the C1 manuscript is invariant under FWER correction — this is the single most-important and most-positive finding for the paper's defensibility.

Concrete reviewer-defensible statement

Across the 20 Vuong likelihood-ratio + Scheffer block-bootstrap hypothesis tests reported in the manuscript (rsmall = 0.05 per test, FWER_naive = 0.64), Bonferroni-Holm step-down correction at FWER = 0.05 does not change any rejection decision. The "rejected" verdicts (p_raw in [1e-93, 1e-6]) survive with p_holm <= 1e-5; the "inconclusive" verdicts (p_raw > 0.1) remain inconclusive under FDR (BH adjusted p > 0.1). The paper's statistical verdicts are robust to family-wise error correction.

In C1 §6.5 Limitations, add point (ix):

Multiple-testing correction is not applied to the 20 statistical tests reported (Vuong likelihood-ratio per system + Scheffer block-bootstrap). Under nominal alpha = 0.05 per test, FWER_naive ≈ 0.64. Applying Bonferroni-Holm step-down correction at FWER = 0.05 does not change any rejection decision: the rejected lognormal-vs-power-law tests survive with p_holm < 1e-5, and the inconclusive tests remain inconclusive (BH-adjusted p > 0.1). Source: v4/lib/multitest_correction.py, run via v4/scripts/F3_apply_fwer_correction.py, output v4/results/F3_fwer_corrected.jsonl and v4/results/F3_fwer_summary.json.

Outputs

  • v4/lib/multitest_correction.py — correction utility (3 methods)
  • v4/tests/sanity/test_multitest_correction.py — 15 unit tests
  • v4/scripts/F3_apply_fwer_correction.py — harvester + applier
  • v4/results/F3_fwer_corrected.jsonl — per-test row with raw + 3 adjusted p
  • v4/results/F3_fwer_summary.json — aggregate summary

References

  • Holm S (1979). "A simple sequentially rejective multiple test procedure." Scand. J. Stat. 6, 65-70.
  • Benjamini Y, Hochberg Y (1995). "Controlling the false discovery rate." J. R. Stat. Soc. B 57, 289-300.
  • Wright SP (1992). "Adjusted P-values for simultaneous inference." Biometrics 48, 1005-1013.