Always check if the result is practically significant, not just statistically significant. A 0.01% improvement might be "real" but not worth shipping.
Look for: defining success metrics before the test, calculating sample size for statistical power, checking for statistical significance (p-values, confidence intervals), watching for novelty effects, ensuring proper randomisation, considering practical significance vs statistical significance, and segmenting results. Pitfalls: peeking at results early, multiple comparisons, Simpson's paradox.
Core data analyst skill. Strong candidates mention both statistical and practical significance. Ask: "The test is significant at p=0.04 but the effect size is tiny. What do you recommend?"