Designing an A/B Testing
Selecting Metrics for the Experiment
Tip
Experiment Metrics Criteria:
- Measurable within the experiment period.
- Attributable to the change of the experiment variants.
- Sensitive enough to detect change that matter in timely fashion
Focus on Driver metrics and Guardrail metrics not the Goal metrics
Driver Metrics
Surrogate or Indirect metrics, leading indicators, focus on short term objective. align with goal metrics, more sensitive and actionable
Examples: Marketing goal: acquire new user Driver metric: # of acquired user / day
source: https://www.convert.com/blog/a-b-testing/ab-testing-metrics-guide
Guardrail Metrics
Guard from harming the business and violating assumption
-
Organizational Guardrail if negative, the business will suffer significant loss example:
- Website/App performance Latency: wait times for pages to load.
- Error Logs: number of error messages.
- Client Crashes: crashes per user.
- Business goals
- Revenue: revenue per user and total revenue.
- Engagement: e.g., time spent per user, daily active users (DAU), and page views per user
-
Trustworthy Related Check violation of Assumption
- check if randomization units is truly random (use t-test or chi-squared test)
- Sample Rate Missmatch (SRM)
In practice, context is important for metrics. one team driver metrics can become other team guardrail metrics. For example:
Front-end team: Goal: reducing latency Driver metric: time to interactive (TTI) When running A/B test, Product Team can use FE team driver metric as the guardrail metric.
Selecting Randomization Unit
Choosing Target Population
Computing Sample Size
Determine Test Duration
Analyze Results
Sanity Check
Hypothesis test
Statistical and Practical Significance
- Practically Significant
(p-value < 0.5, and practically significant)
- Decision: Launch
- Not practically significant
- Case1: Change is not statistically significant and not practically significant (iterate or abandon this idea)
- Case2: Statistically significant but not practically significant. If Implementation is costly, maybe it’s not worth it to launch, but if it’s low then no harm on launching.
- Likely statistically/practically Significant
- Case1
- statistically significant and likely practically significant