Ab-Testing

July 23, 2023 2-minute read

product-analyst

Designing an A/B Testing

Selecting Metrics for the Experiment

Tip

Experiment Metrics Criteria:

Measurable within the experiment period.
Attributable to the change of the experiment variants.
Sensitive enough to detect change that matter in timely fashion

Focus on Driver metrics and Guardrail metrics not the Goal metrics

Driver Metrics

Surrogate or Indirect metrics, leading indicators, focus on short term objective. align with goal metrics, more sensitive and actionable

Examples: Marketing goal: acquire new user Driver metric: # of acquired user / day

source: https://www.convert.com/blog/a-b-testing/ab-testing-metrics-guide

Guardrail Metrics

Guard from harming the business and violating assumption

Organizational Guardrail if negative, the business will suffer significant loss example:
- Website/App performance Latency: wait times for pages to load.
- Error Logs: number of error messages.
- Client Crashes: crashes per user.
- Business goals
- Revenue: revenue per user and total revenue.
- Engagement: e.g., time spent per user, daily active users (DAU), and page views per user
Trustworthy Related Check violation of Assumption
- check if randomization units is truly random (use t-test or chi-squared test)
- Sample Rate Missmatch (SRM)

In practice, context is important for metrics. one team driver metrics can become other team guardrail metrics. For example:

Front-end team: Goal: reducing latency Driver metric: time to interactive (TTI) When running A/B test, Product Team can use FE team driver metric as the guardrail metric.

Selecting Randomization Unit

Choosing Target Population

Computing Sample Size

Determine Test Duration

Analyze Results

Sanity Check

Hypothesis test

Statistical and Practical Significance

Practically Significant (p-value < 0.5, and practically significant)
1. Decision: Launch
Not practically significant
1. Case1: Change is not statistically significant and not practically significant (iterate or abandon this idea)
2. Case2: Statistically significant but not practically significant. If Implementation is costly, maybe it’s not worth it to launch, but if it’s low then no harm on launching.
Likely statistically/practically Significant
1. Case1
statistically significant and likely practically significant