4.4. Power¶
After completing the steps in the previous section, businesses face an additional decision: how much data to collect in their experiment. Of course, more data is always better, but there are usually economic limitations that prevent one from collecting an arbitrarily large sample. Therefore, we face a trade-off: large samples may be prohibitively expensive, but if our sample is too small we may not have enough “power” to detect a significant difference between our treatment and control groups. That is, we may not have enough data to reject the null when there truly is a difference between the two groups. Power analysis allows us to calculate the minimum sample size required to detect a difference, given that one really exists.
Using the language of hypothesis testing from Hypothesis Testing, Musicfi plans to test the following hypotheses, where \(\mu_0\) is the mean of the control group and \(\mu_1\) is the mean of the treatment group:
\(H_o: \mu_0 = \mu_1\)
\(H_a: \mu_0 \ne \mu_1\)
Imagine that the alternative hypothesis (\(H_a\)) is true, i.e., the mean streaming minutes is different for customers in the control and treatment groups. Even under this scenario, there is no guarantee that our analysis will detect the difference because we have a sample of customers (fortunately a random sample) and not the population of all possible customers. The question then becomes: what is the smallest sample we can collect that will still provide a reasonable chance of detecting a difference if one exists?
To tackle this question, think through the following scenarios. What do you think would require more data:
(a) Detecting a difference in streaming time between free and premium users when the true average difference is five minutes, or (b) Detecting a difference when the true average difference is only two minutes?
Intuitively, the larger the true difference, the more likely that difference will manifest itself in our random sample. Therefore, we need would need more data in scenario (b) to pick up on the difference than we would in scenario (a).
Let’s imagine that based on the business context, Musicfi does not care if the difference between free and premium users is less than two minutes; in other words, if the true difference is two minutes or less, Musicfi would consider that difference negligible. Therefore, they want to determine the smallest possible sample size that still has a good chance of detecting a difference of two minutes or greater. Note this implies that if the true difference is less than two minutes, the experiment will likely not pick up on that difference. Determining the minimum detectable difference is an important step in a power analysis, and one that requires input from managers who have business area expertise.
To conduct a power analysis, we need several pieces of information:
The significance level of the test (\(\alpha\)). As before, we will use a significance level of 0.05.
The power of the test (\(1-\beta\)). This is the probability that our test will detect the difference given that there is a difference in the population. A common choice for \(1-\beta\) is 0.8. Note that this implies there is still a \(\beta\) = 20% chance our test will not detect the difference when a difference actually exists.
The treatment effect, defined as the difference between the two groups (\(\mu_1 - \mu_o\)). Of course, we don’t know the true population treatment effect — this is the quantity we are trying to estimate from the experiment. However, to determine how big our sample size should be, we need a plausible guess of the smallest effect that we would want to detect in our study. Typically, the estimate is based on historical data; we usually shrink the estimate closer to zero because our data is likely subject to unobserved confounding.
The pooled standard deviation of the two groups, usually estimated from historical data.
We then combine 3 and 4 to compute the normalized difference in means \(d\) (know as the effect size, or Cohen’s \(d\)) by dividing our estimate of the treatment effect by the pooled standard deviation.
For example, with the Musicfi data we want to observe a minimum difference of two minutes, so we would calculate \(d\) as:
The quantity \(s\) is the pooled standard deviation of the two groups, and is calculated from the standard deviations of the control group and the treatment group (\(s_o\) and \(s_1\) respectively) using the formula below. These quantities typically need to be estimated from historical data; in our case, we can use the data from Musicfi’s previous experiment, which is stored in musicfiExp
.
After determining the three inputs using our historical data, we use the pwr.t.test()
function from R’s pwr
package to calculate the required sample size.
Syntax
pwr::pwr.t.test(d, power, sig.level=0.05)
Required arguments
d
: The effect size, i.e., Cohen’s \(d\).power
: The power of the test (\(1-\beta\)).
Optional arguments
conf.level
: The significance level of the test.
Let’s say we want to design a new experiment that will have an 80% chance of detecting a difference of two minutes or greater at a 5% significance level. We can use the pwr.t.test()
function to calculate the minimum sample size we will need for this experiment.
First, we need to calculate \(s\) based on our historical data:
# Create separate data frames for free and premium users
musicfiExpFree = subset(musicfiExp, AccountType=="Free")
musicfiExpPremium = subset(musicfiExp, AccountType=="Premium")
# Calculate the sample standard deviation of streaming minutes for free and premium users
s0 = sd(musicfiExpFree$StreamingMinutes)
s1 = sd(musicfiExpPremium$StreamingMinutes)
# Calculate the estimate of the pooled standard deviation
s = sqrt((s0^2 + s1^2)/2)
Now we can use s
to calculate Cohen’s \(d\):
d = 2/s
Finally, we can apply the pwr.t.test()
function to determine our sample size:
pwr.t.test(d=d, power=0.8)
Two-sample t test power calculation
n = 127.7748
d = 0.3518408
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
These results indicate that (rounding up) we need at least 128 participants in the control and treatment groups, for a total sample size of 256.