
We can use the difference-in-means estimator to estimate the average treatment effect, and there is a simple and beautiful mapping between the change in our experiment and the expected shift in the dashboard metric if we roll out the A/B-tested experience to the population. In the case of a single observation per unit, it’s often possible to have the same “metric” (e.g., a mean or sum of some aspect of a unit) in a dashboard as in an experiment. When diving into the longitudinal data literature, it’s clear that the current view of metrics in the online experimentation community is insufficient. The mental switch from “metrics” to “models” The challenges of compressing data to scale result calculations across thousands of experiments.The mental switch from “metrics” to “models”.When extending experimentation with sequential testing from cross-sectional to longitudinal data, several things change. Practical considerations for running sequential tests with longitudinal data The dependency within units complicates the covariance structure of the test statistics in a sequential test - which, in turn, makes it more complicated to derive valid and efficient ways of conducting sequential inference. We typically operate under an assumption of independent measurements between units - but not within units. This might seem like a subtle difference, but for sequential testing this complicates things. With longitudinal data, the covariance between the test statistics at two intermittent analyses comes both from new measurements of the same units and the overlapping sample between the analyses. With one observation per unit, the dependence between test statistics only comes from the overlapping sample of independent units. For sequential tests, we need to know the dependence structure of the vector of test statistics over the intermittent analyses. To derive fixed-horizon tests for any test statistic, it’s sufficient to know its marginal sampling distribution. However, when combining longitudinal data and sequential testing, longitudinal data quickly introduces particular challenges. It’s not uncommon to use longitudinal data in online experimentation, and typically this requires no special attention when using fixed-horizon analysis. Why do we need to pay extra attention to longitudinal data in sequential testing? Some concluding remarks on how Spotify applies sequential tests on longitudinal data experiments.How group sequential tests (GSTs) apply to longitudinal data models, and some GST-specific aspects to be mindful of.Practical considerations for running sequential tests with longitudinal data.

Why we need to pay extra attention to longitudinal data in sequential testing.In Part 2 of the series, we share some learnings from implementing sequential tests based on data with multiple observations per unit, i.e., so-called longitudinal data. In a small simulation, we showed that standard sequential tests combined with open-ended metrics can lead to inflated false positive rates (FPRs). We discussed the importance of being clear about the distinctions between measurement, metric, estimand, and estimator, and we exemplified those distinctions with so-called cohort-based metrics and open-ended metrics. We showed that moving from single to multiple observations per unit in analyses of experiments introduces new challenges and pitfalls with regards to sequential testing. In Part 1 of this series, we introduced the within-unit peeking problem that we call the “peeking problem 2.0”.
