PROXIMA

A validation framework for proxy metrics in online controlled experiments.

Research / Preprint (stat.ME, pending)GitHub Zenodo DOI

Problem

Teams running A/B tests often rely on short-term proxy metrics (e.g., click-through rate) to make shipping decisions about long-term outcomes (e.g., retention). If the proxy is unreliable — due to sign flips, segment-level paradoxes, or weak correlation — teams unknowingly ship regressions. There is no standard framework for quantifying whether a proxy metric is trustworthy enough to base decisions on.

Approach

PROXIMA introduces a formal framework for proxy metric validation. It computes a composite reliability score R ∈ [0,1] based on three pillars: directional agreement (does the proxy agree with the oracle on which variant wins?), rank preservation (Kendall's τ_b across experiments), and segment-level fragility detection (Simpson's Paradox reversals where the proxy says one thing for the whole population but the opposite for subgroups). A shipping decision simulator then estimates the expected regret of trusting the proxy.

┌─────────────────────────────────────────────────────────┐
│                  A/B Test Platform                       │
│         (proxy metric P, long-term oracle O)            │
└──────────────────────┬──────────────────────────────────┘
                       │
┌──────────────────────▼──────────────────────────────────┐
│              PROXIMA Pipeline                            │
│                                                         │
│  ┌─────────────┐  ┌──────────────┐  ┌───────────────┐  │
│  │ Directional │  │    Rank      │  │   Segment     │  │
│  │ Agreement   │  │ Preservation │  │   Fragility   │  │
│  │   (DA)      │  │    (τ_b)     │  │  (Simpson's)  │  │
│  └──────┬──────┘  └──────┬───────┘  └──────┬────────┘  │
│         │                │                  │           │
│  ┌──────▼────────────────▼──────────────────▼────────┐  │
│  │         Composite Reliability Score               │  │
│  │              R ∈ [0, 1]                           │  │
│  │   R = w₁·DA + w₂·τ_b + w₃·(1 - fragility)      │  │
│  └──────────────────────┬────────────────────────────┘  │
│                         │                               │
│  ┌──────────────────────▼────────────────────────────┐  │
│  │         Shipping Decision Simulator               │  │
│  │     Estimates regret and false-ship rate          │  │
│  └───────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────┘

How it works

Directional Agreement

For each simulated experiment, PROXIMA checks whether the proxy metric P and the long-term oracle O agree on which variant (treatment vs. control) is better. Across 80 experiments on two datasets, PROXIMA achieves 98.4% agreement with the oracle — meaning it correctly identifies when a proxy would have led to the wrong shipping decision in 98.4% of cases.

Simpson's Paradox Detection

PROXIMA segments users by covariates (e.g., device type, region, tenure) and checks for sign reversals: cases where the proxy shows a positive effect overall but a negative effect in one or more subgroups. These reversals indicate structural fragility in the metric. The framework reports which segments are reversed and their sample sizes.

Composite Reliability Score

The three pillars are combined into a single score R ∈ [0,1] via a weighted formula: R = w₁·DA + w₂·τ_b + w₃·(1 - fragility). Proposition 1 in the paper proves that R is monotonically decreasing in proxy unreliability under mild regularity conditions. Algorithm 1 provides the full computation procedure.

Evaluation Datasets

Evaluated on two public datasets: Criteo Uplift (14 million observations, 50 simulated experiments) for e-commerce and KuaiRec (7,176 users, 30 simulated experiments) for recommendation systems. Both datasets provide ground-truth long-term outcomes, enabling direct oracle comparison.

Metrics

98.4%

Oracle Agreement

14M

Criteo Observations

7.2K

KuaiRec Users

Simulated Experiments

14 pages

Paper Length

Formal Propositions

Tech stack

Core

PythonNumPySciPypandas

Statistical Methods

Kendall's τ_bBootstrap CISimpson's Paradox detectionUplift modeling

Evaluation

Criteo Uplift DatasetKuaiRec Dataset

Interface

FastAPIReact Dashboard

Lessons learned

The hardest part wasn't the statistics — it was defining what 'reliable' means precisely enough to compute. The framework went through three complete redesigns of the scoring function before converging on the composite R. I also learned that evaluation on public datasets is necessary for credibility, but the datasets that have both proxy and long-term outcomes are rare. Criteo and KuaiRec were the only viable options.

Timeline

Research started June 2025. 14-page paper with formal definitions, Proposition 1, Algorithm 1. Zenodo DOI registered. arXiv submission to stat.ME pending. Code and data publicly available on GitHub.