The Uber experimentation team leverages two main methodologies to perform sequential testing for metrics monitoring purposes: the mixture sequential probability ratio test (mSPRT) and variance estimation with FDR.
Mixture Sequential Probability Ratio Test
The most common method they use for monitoring is mSPRT. This test builds on the likelihood ratio test by incorporating an extra specification of mixing distribution H. Suppose we are testing the metric difference with the null hypothesis being $\theta$ , then the test statistics could be written as:
Since we have large sample sizes and the central limit theorem can be applied to most cases, we use normal distribution as our mixing distribution, $H ~ N(0,r^2)$. This leads to easy computation and a closed form expression for
Another useful property about this method is under null hypothesis, nH, 0 is proven to be a martingale:
. Following this, we could construct $(1 - \alpha)$ confidence interval.
Variance estimation with FDR control
To apply sequential testing correctly, we need to estimate variance as accurately as possible. Since we monitor the cumulative difference between our control and treatment groups on a daily basis, observations from the same users introduce correlations which violate the assumption of the mSPRT test. For example, if we are monitoring click through rates, then the metric from one user across multiple days may be correlated. To overcome this, we use delete-a-group jackknife variance estimation/block bootstrap methods to generalize mSPRT test under correlated data.
Since our monitoring system wants to evaluate the overall health of an ongoing experiment, we monitor many business metrics at the same time, potentially leading to false alarms. In theory, either the Bonferroni or BH correction could be applied in this scenario. However, since the potential loss of missing business degradations can be substantial, we apply BH correction here and also tune in parameters (MDE, power, tolerance for practical significance, etc.) for metrics with varying levels of importance and sensitivity.