The p-value calculation is central to Uber's statistics engine. The p-value directly determines whether the XP reports that a result is significant. They compare the p-value to the false positive rate (Type-I error) they desire (0.05) in a common A/B test. Their XP leverages various procedures for p-value calculation, including:
- Welch’s t-test, the default test used for continuous metrics, e.g., completed trips.
- The Mann-Whitney U test, a nonparametric rank sum test used to detect severe skewness in the data. It requires weaker assumptions than the t-test and performs better with skewed data.
- The Chi-squared test, used for proportion metrics, e.g., rider retention rate.
- The Delta method (Deng et al. 2011) and bootstrap methods, used for standard error estimation whenever suitable to generate robust results for experiments with ratio metrics or with small sample sizes, e.g., the ratio of trips cancelled by riders.
On top of these calculations, Uber uses multiple comparison correction (the Benjamini-Hochberg procedure) to control the overall false discovery rate (FDR) when there are two or more treatment groups (e.g., in an A/B/C test or an A/B/N test).