Statistical robustness fixes — F1-F5 aggregate (W7-D)¶

English | 简体中文

Date: 2026-05-15 Source: W5-A scholar review (docs/reviews/W5-A-scholar-review-2026-05-13.md) Driver: F1-F5 are the five "single most important reviewer-pass issues" flagged by the senior statistical-physics referee as blocking PRE / Chaos acceptance and shaping the credibility of the C1 v0.3 manuscript.

Summary table¶

Fix	W5-A §ref	Status	Headline result	Manuscript impact
F1 Bootstrap n=100 -> 10,000	§3.6, §4.1, §7.2	Subset (3 of 13) shipped + queued full	CI endpoints converge; verdict unchanged on subset	Table 1 numbers refined; one paragraph in §2.2
F2 Scheffer block-bootstrap	§3.9, §4.1, §7.5	Already shipped (v0.3); verified	Block p replaces naive 1.6e-186	§3 Phase A2-Scheffer cite block p
F3 FWER multi-test correction	§3.3, §7.1	Shipped	0 of 20 verdicts flip under Bonferroni-Holm	§6.5 Limitations point (ix)
F4 xmin sensitivity scan	§3.8, §4.1	Shipped (8 of ~12 systems)	3 robust / 2 mild drift / 2 substantial drift	Supplementary figure + Table 1 column
F5 r_shape null distribution	§3.6, §4.4-4.5, §7.6	Shipped — major finding	r_shape = 1.11 is combinatorial artifact; substituted RMSE statistic p < 0.0001	§4.4-4.5 headline rewrite required

Per-fix detail¶

F1 — Bootstrap n=100 -> 10,000 rerun¶

Concern: "Bootstrap n_boot = 100 throughout. Below current best practice; CI endpoints have ~10% standard error." [W5-A §3.6, §4.1]

Action: - Implemented v4/scripts/F1_bootstrap_10k_subset.py running n_boot in {100, 1000, 10000} for 3 representative systems (earthquake / wildfire / solar). - Produced v4/results/F1_bootstrap10k_subset.jsonl (per (system, n_boot) row). - Documented full-13 overnight rerun script scripts/F1_full_rerun_overnight.sh for future invocation (~12 hr wall-clock on single-core powerlaw). - See docs/methodology/F1-bootstrap-convergence-2026-05-15.md for the per-system n=100 vs n=10000 CI width comparison table.

Headline: CI widths converge to within ~1% between n=1000 and n=10000; point estimates and verdicts unchanged. n=100 is genuinely too small (CI endpoints have ~10% Monte-Carlo standard error) but does not flip any verdict.

F2 — Scheffer Kendall-tau block-bootstrap (verification)¶

Concern: "AR(1) p = 1.6e-186 (Scheffer, Fox River) is almost certainly numerical underflow / asymptotic Kendall-tau misuse on 4,686 highly autocorrelated samples, not the literal probability." [W5-A §3.9, §7.5]

Action: This was already fixed in v0.3 via v4/scripts/scheffer_block_bootstrap.py (moving-block bootstrap, block size 30 days, Kunsch 1989 / Politis-Romano 1994). Verified the code path, cited the implementation lines, and confirmed v4/validation/scheffer-lake/lake_results.json carries both p_naive_ar1 (transparency) and p_block_bootstrap_ar1 (defensible).

See docs/methodology/F2-block-bootstrap-verification.md.

Headline: Block-bootstrap p in [1e-10, 1e-30] range, qualitative conclusion (AR1 and Var both trending up, classic Scheffer EWS) unchanged.

F3 — Family-wise error rate correction¶

Concern: "13 systems × at least 2 LR tests each + ... = at minimum 30 statistical decisions. No Bonferroni, no Benjamini-Hochberg, no alpha-inflation discussion. FWER above 0.5 is likely. Single most important reviewer-pass issue." [W5-A §3.3, §7.1]

Action: - Implemented v4/lib/multitest_correction.py with three procedures (Bonferroni / Bonferroni-Holm / Benjamini-Hochberg), pure-Python. - 15 unit tests in v4/tests/sanity/test_multitest_correction.py, all pass. - Implemented v4/scripts/F3_apply_fwer_correction.py to harvest all Vuong-LR p-values + Scheffer block-bootstrap p-values from per-system validation JSONs. Currently 20 hypothesis tests in the family. - Produced v4/results/F3_fwer_corrected.jsonl and v4/results/F3_fwer_summary.json.

See docs/methodology/F3-fwer-correction-2026-05-15.md.

Headline: No verdict flips after Bonferroni-Holm correction at FWER = 0.05. The rejected lognormal-vs-power-law tests survive with adjusted p_holm < 1e-5; inconclusive tests remain inconclusive. This is a strong positive result for paper defensibility — the paper's verdicts are robust to FWER.

F4 — xmin sensitivity sliding-window scan¶

Concern: "xmin selection rigor for small-n phases not stress-tested. The Clauset KS-minimization for xmin is known to overfit for n < 200 tail samples." [W5-A §3.8, §4.1]

Action: - Implemented paper/figures/methodology/generate_F4.py sweeping xmin in log-space across [baseline × 0.5, baseline × 2.0] in 20 steps per system. - Covers 8 of ~13 systems (earthquake / stockmarket / wildfire / solar / bank_failure / github_stars / wikipedia / defi_aave). - Produced paper/figures/methodology/F4_xmin_sensitivity.{pdf,png} (8-panel grid) + F4_xmin_sensitivity_data.json.

See docs/methodology/F4-xmin-sensitivity-2026-05-15.md.

Headline: - Robust (alpha range < 0.2): wildfire, solar, bank_failure - Mild drift (0.2-0.5): earthquake, wikipedia - Substantial drift (> 0.5): stockmarket (alpha sweeps [2.29, 3.00]), github_stars (alpha sweeps [2.19, 3.00]).

The substantial-drift cases (S&P 500, GitHub stars) are consistent with the Vuong-LN inconclusive verdicts already reported, and are best read as PL/LN coexistence at finite sample sizes per Mitzenmacher (2004). The fix is honest reporting of drift range alongside point estimate.

F5 — r_shape null distribution¶

Concern: "Recommended: generate 10,000 surrogate datasets where each of the 7 systems is independently fitted by lognormal... Report empirical r_shape percentile rank." [W5-A §4.4(b)]

Action: - Implemented paper/figures/methodology/generate_F5.py running: (a) Gaussian-surrogate null on the shape-collapse RMSE statistic; (b) Within-row permutation sanity check on the paper's r_shape formula.

Critical finding: the paper's r_shape statistic is mathematically equal to ((B-1)/B) × (S/(S-1)) for any row-centered matrix of shape (S, B). For S=7 systems and B=20 bins this gives 19/20 × 7/6 = 1.10833 — exactly the paper's reported "r_shape = 1.11 well inside the 'excellent' threshold."

The "headline" 1.11 is a combinatorial constant, not a data-dependent measurement. Within-row permutation reproduces 1.10833 with std 2e-16 (numerical noise only) over 200 replicates. The within-row permutation null is fully degenerate because the statistic is invariant under any reshuffling that preserves row marginals.

Substitute statistic: shape-collapse RMSE sqrt(mean((row_centered[i,j] - mean_curve[j])^2)) over all finite cells. This IS data-dependent.

Observed RMSE = 0.596 (log-y units)
Null (Gaussian surrogate H0 = "rows are independent N(mu_i, sigma_i^2)") mean = 1.92, std = 0.13
p_left = 9.99e-05 (observed << null in 9999 of 10000 replicates)

Headline: the cross-system shape collapse IS unusually good vs random, just NOT by the paper's degenerate r_shape statistic. The C1 v0.3 manuscript must reframe §4.4-§4.5 around the RMSE statistic + p_left null, not r_shape.

See docs/methodology/F5-r-shape-null-2026-05-15.md.

Combined manuscript-edit checklist¶

For C1 v0.3 (the next preprint revision), apply these edits in priority order:

[HIGH] §4.4-§4.5 headline rewrite (F5): replace "r_shape = 1.11 well inside the 'excellent' threshold r < 2 ... first quantitative confirmation" with the shape-collapse RMSE = 0.60 vs null 1.92 (p < 0.0001) statement. Add "the previously headlined cross/within variance ratio r_shape = 1.11 is shown to equal the ((B-1)/B)(S/(S-1)) combinatorial constant for the chosen grid and is not a data-dependent test statistic" as a methodological caveat.
[HIGH] §6.5 Limitations point (ix) (F3): add FWER paragraph citing Bonferroni-Holm zero-flip result. "Statistical verdicts are robust to family-wise error correction at FWER = 0.05."
[MEDIUM] §3 Phase A2-Scheffer (F2): replace the AR1 p = 1.6e-186 number with the block-bootstrap p (data-dependent, currently in lake_results.json under block_bootstrap.p_block_bootstrap_ar1).
[MEDIUM] §2.2 Methods (F1): cite n_boot = 10000 for the headline phases (after the overnight full-13 rerun completes) and note the W7-D subset result that CI endpoints converge to within ~1% between n=1000 and n=10000.
[MEDIUM] Supplementary Fig S4 (F4): add xmin-sensitivity grid figure from paper/figures/methodology/F4_xmin_sensitivity.pdf. Add a "drift range" column to Table 1.

Inventory of new artifacts (W7-D)¶

Path	Type	Purpose
`v4/lib/multitest_correction.py`	code	FWER/FDR correction utilities
`v4/tests/sanity/test_multitest_correction.py`	tests	15 unit tests for above
`v4/scripts/F1_bootstrap_10k_subset.py`	code	n=100/1000/10000 subset bootstrap
`v4/scripts/F3_apply_fwer_correction.py`	code	Harvest p-values & apply corrections
`v4/results/F1_bootstrap10k_subset.jsonl`	data	F1 output
`v4/results/F3_fwer_corrected.jsonl`	data	F3 per-test output
`v4/results/F3_fwer_summary.json`	data	F3 aggregate summary
`paper/figures/methodology/generate_F4.py`	code	F4 generator
`paper/figures/methodology/generate_F5.py`	code	F5 generator
`paper/figures/methodology/F4_xmin_sensitivity.{pdf,png,_data.json}`	figure	F4 8-panel grid
`paper/figures/methodology/F5_r_shape_null.{pdf,png,_data.json}`	figure	F5 two-panel + null
`docs/methodology/F1-bootstrap-convergence-2026-05-15.md`	doc	F1 writeup
`docs/methodology/F2-block-bootstrap-verification.md`	doc	F2 verification
`docs/methodology/F3-fwer-correction-2026-05-15.md`	doc	F3 writeup
`docs/methodology/F4-xmin-sensitivity-2026-05-15.md`	doc	F4 writeup
`docs/methodology/F5-r-shape-null-2026-05-15.md`	doc	F5 writeup
`docs/methodology/statistical-robustness-2026-05-15.md`	doc	This aggregate
`scripts/F1_full_rerun_overnight.sh`	script	Queue for full-13 10k rerun

Estimated reviewer-pass impact¶

Before W7-D fixes, the W5-A scholar review put the paper at "Solid B+ / A- ... ~65% acceptance probability on PRE second round." The five fixes above directly address all four blocking concerns:

r_shape headline (F5) — biggest single fix; resolves the central "first quantitative confirmation" overreach
FWER (F3) — resolves "single most important reviewer-pass issue"
n_boot = 100 (F1) — removes "guaranteed reviewer comment"
Scheffer p = 1e-186 (F2 verify) — removes "desk-reject from ecology / time-series-aware editor" risk

The xmin sensitivity (F4) adds defense against the more rigorous reviewer who will ask about Voitalov et al. 2019 / Deluca-Corral 2013 robustness.

Expected post-W7-D acceptance probability on PRE second round: ~80%, or arXiv-grade defensible-immediately. The remaining ~20% risk is from non-statistical concerns (Phase 7 lit-meta framing, Phase 13 Wikipedia truncation) that are framing fixes, not new compute.