Customizing Algorithms to Drive Better Outcomes

By Dr. Douglas Popken,
SVP of Analytics


Changing consumer behavior is complicated. Many different factors can impact human health-related behavior, both systematically (weather, changing regulations, general economic conditions, cultural influences, etc), and through individually based random variations (health status, mood, prior-held beliefs, financial circumstances, etc). Healthcare marketing teams craft messages targeted at consumers to try to influence consumer behavior, but even when they see impacts, they often cannot pinpoint whether it was their messages or perhaps another factor that caused the change. It then becomes difficult, if not impossible, to know how to efficiently allocate intervention resources.

At NextHealth, we designed our automated platform and methodology to address this critical issue. NextLift, a major component of the platform, includes a module that calculates both the lift and the statistical significance of each campaign or “nudge” impact on consumer behavior.  By using randomized controlled trials (RCT’s) we can isolate whether the change is due to the nudge itself or some other systematic factor.  By considering significance, we can also judge whether the measured lift is simply a reflection of random variations in the data.

In statistics, one way to distinguish an underlying effect from random variation is to use a measure called “p-value”.  P-value is a probability – the probability of observing a lift value more extreme than what was observed if the “null hypothesis” (that the true lift is equal to zero) were true.  P-value provides a scientific basis for claiming causality.  To claim success, the p-value needs to be small; at NextHealth, we use as a baseline statistical standard that if p <= .05, then the observed lift can be attributed to the nudge.

Lift is measured via a statistical “comparison of means” test, where the mean KPI (Key Performance Indicator) for the trial group is compared to that of the control group.  The standard statistical approach for determining the significance (and confidence intervals) of a comparison of means test is some form of “two-sample t-test”.  The drawback to unmodified use of these traditional techniques is that they require assumptions about the nature of the underlying data, especially, that the data follows a normal distribution.  However, the medical utilization and cost data typically encountered with our clients is often highly skewed with both a long right-hand tail and many zero values.  On the other hand, the two-sample t-test is known to be highly robust to non-normality if the data sets are large enough and/or the data is not too severely non-normal.  In practice, it is difficult to know when these conditions have been met.  For these reasons, NHT now uses a modern, robust methodology known as the bootstrap technique (Efron and Tibshirani, 1993) for determining the p-value and confidence intervals for the lift value.

The bootstrap method is based on repeated resampling to simulate comparison outcomes, with samples drawn from an empirical distribution of the observed trial and control data.  Its key advantage is that it requires no assumptions about normality and is therefore highly appropriated for skewed medical data.  It reduces the motivation to transform the data before analysis to make it less non-normal (e.g log transformations or trimmed mean approaches), allowing for direct comparison of the true means of the two groups.  To compute p, we use bootstrap samples of the t-statistic (each of which is computed as described below) for comparison of the population means of the two groups.  A variance stabilization technique is automatically applied to achieve the highest accuracy.  To compute confidence intervals on the lift, we use a bootstrap distribution of mean lift values with bias and skewness corrections known as BCa.  The specific techniques we use are described in greater detail in Barber and Thompson (2000).  See also Efron (1987).

To compute the t-statistic for each bootstrap sample, NextHealth uses a modification of a well-known statistical test for comparing the means of two populations with unequal variances, “Welch’s t-test” (see  or  Other methodologies are available, but research has shown that Welch’s t-test generally provides the most accurate results for the type of data we typically work with (Fagerland and Sandvik, 2009).  The modification NextHealth has made to the standard Welch’s t-test is to weight the observations in each group by the duration of the observation period (in years).  One immediate advantage is that the duration weighted mean is equivalent to the population mean KPI per member year (total KPI value/total member years), which is the most relevant statistic for our analyses.  Another motivation for weighting is that our underlying observations are annualized, but the observations themselves have variable observation periods.  Without weighting, a non-zero observation resulting from a short observation period can become relatively large when annualized, thereby causing the dataset to have unusually high variances, particularly in the early stages of a campaign/program.  A similar weighting approach is described in Bland and Kerry (2008) for a t-test that assumed a single pooled variance for the samples.  To compute the weighted means and weighted variance of each group, we rely on equations described at, in the subsection, “Weighted Sample Variance…Reliability Weights”.   The weighted means and weighted variances computed from these equations then replace the unweighted mean and variance parameters described within the Welch’s t-test.

Measuring program success in the face of skewed outcome data with a high degree of random variation is difficult.  NextHealth has employed a combination of the best statistical methodologies available to achieve the highest degree of accuracy.


  • Barber, J.A. and Thompson, S.G.  Analysis of cost data in randomized trials: an application of the non-parametric bootstrap.  Statistics in Medicine, 19, 3219-3236.
  • Bland, J.M. and S. Kerry.  2008.  Weighted comparison of means.  BMJ, 316, 129.
  • Efron, B.  1987.  Better bootstrap confidence intervals (with comments).  Journal of the American Statistical Association, 82(397), 171-200.
  • Efron, B. and Tibshirani, R.J.  1993.  An Introduction to the Bootstrap.  Chapman and Hall, New York
  • Fagerland, M.W. and L. Sandvik.  2009.  Performance of five two-sample location tests for skewed distributions with unequal variance.  Contemporary Clinical Trials, 30, 490-496.