Customizing NextLift™ Algorithms to Drive Better Outcomes
By Dr. Douglas Popken,
SVP of Analytics
Changing consumer behavior is complicated. Many different factors can impact human health-related behavior, both systematically (weather, changing regulations, general economic conditions, and cultural influences, etc.), and through individually-based random variations (health status, mood, prior-held beliefs, financial circumstances, etc.). Healthcare marketing teams craft messages targeted at consumers to try to influence their behavior, but even when the teams observe behavioral changes in the population, they often cannot pinpoint whether it was their messages or perhaps another factor that caused the change. It becomes difficult if not impossible to know how to efficiently allocate intervention resources for maximum impact.
At NextHealth Technologies, we designed our automated platform and methodology to address this critical issue. NextLift™ is a module in the platform that precisely calculates both the lift and its statistical significance for each campaign or “nudge”. By using randomized controlled trials or “RCT’s”, we can isolate whether the change is due to the nudge itself or some other systematic factor. By considering significance, we can also judge whether the measured lift is simply a reflection of random variations in the data.
On a more technical level, in statistics, the way to distinguish an underlying effect from random variation is to use a measure called “p-value”. P-value is a probability – the probability of observing a lift value more extreme than what was observed if the “null hypothesis” (that the true lift is equal to zero) were true. (Look here for a good primer on P-value.) P-value provides a scientific basis for claiming causality. To claim success, the p-value needs to be small. At NextHealth, we use as a baseline statistical standard that if p <= .05, then the observed lift can be attributed to the nudge. Put another way, if the p-value of a nudge is calculated to be .05, then we know with 95% accuracy that the nudge is was what caused the change in consumer behavior.
To determine p-value, NextHealth uses a modification of a well-known statistical test for comparing the means of two populations with unequal variances, “Welch’s t-test” (see this site or this site for more information). Other methodologies are available, but research has shown that Welch’s t-test generally provides the most accurate results for the type of data we typically work with (Fagerland and Sandvik, 2009).
The modification NextHealth has made to the standard Welch’s t-test is to weight the observations in each group by the duration of the observation period (in years). A similar approach is described in Bland and Kerry (2008) for a t-test that assumes a single pooled variance for the samples. The motivation for weighting is that our underlying observations are annualized, but the observations themselves have variable observation periods. Without weighting, a non-zero observation resulting from a short observation period can become relatively large when annualized, thereby causing the data set to have unusually high variances, particularly in the early stages of a campaign/program.
For example, assume that we are comparing the number of ER visits made by trials and controls. We consider two members, A and B, belonging to either group. Member A was nudged 3 months ago, while member B was nudged 6 months ago. Both experienced 2 ER visits since their nudge date (or assignment date in the case of controls). Member A then has an annualized score of 8 visits/year with a weight of .25. Member B has an annualized score of 4 visits/year with a weight of .50. The weighted average visits/year for the two members is ((.25)(8) + (.5)(4))/(.25 + .5) = (2 + 2)/.75 = 5.33 visits/year. Equivalently, we can simply divide the sum of visits (4) by the sum of observation periods (.75) to arrive at the same result. Without weighting, we would say that the two members have an average of 6 visits/year.
Let the annualized values for trials and controls be denoted as: Trial Scores and Control Scores. Without duration weighting, we could simply feed these two sets of values into a standard Python library function, ttest_ind, which would compute the p-value for us. (Note that the case of unequal variances is handled automatically within this function by setting the appropriate input parameter.) Instead, we have customized this function to utilize two additional parameters: Trial Periods and Control Periods, corresponding to the observation periods that provided the underlying scores. To compute the weighted means and weighted variance of each group, we rely on equations described here in the subsection, “Weighted Sample Variance…Reliability Weights”. Our parameters, Trial Scores and Trial Periods replace the xi and wi parameters of the equations shown there (similarly for Control Scores and Control Periods). The weighted means and weighted variances computed from these equations then replaces the unweighted mean and variance parameters described within the Welch’s t-test. The calculation of mean lift, confidence intervals, and p-value proceed in standard fashion for that test.
In summary, we have found that Welch’s t-test, adjusted for duration-weighted observations, provides a superior, stable, and reliable measure of statistical significance.
We would be happy to provide a demo of the system so you can learn more about our advanced analytics.
Bland, J.M. and S. Kerry. 2008. Weighted comparison of means. BMJ, 316, 129.
Fagerland, M.W. and L. Sandvik. 2009. Performance of five two-sample location tests for skewed distributions with unequal variance. Contemporary Clinical Trials, 30, 490-496.