Reading about power, I found an old World Bank Impact Evaluations blog post by Berk Ӧzler on the perils of basing your power calcs in standard deviations without relating those SDs back to the real life context.
Ӧzler summarizes his main points quite succinctly himself:
- Think about the meaningful effect size in your context and given program costs and aims.
- Power your study for large effects, which are less likely to disappear in the longer run.
- Try to use all the tricks in the book to improve power and squeeze more out of every dollar you’re spending.”
He gives a nice, clear example to demonstrate: a 0.3 SD detectable effect size sounds impressive, but for some datasets, this would really only mean a 5% improvement which might not be meaningful in context:
“If, in the absence of the program, you would have made $1,000 per month, now you’re making $1,050. Is that a large increase? I guess, we could debate this, but I don’t think so: many safety net cash transfer programs in developing countries are much more generous than that. So, we could have just given that money away in a palliative program – but I’d want much more from my productive inclusion program with all its bells and whistles.”
Usually (in an academic setting), your goal is to have the power to detect a really small effect size so you can get a significant result. But Ӧzler makes the opposite point: that it can be advantageous to only power yourself to detect what is a meaningful effect size, decreasing both power and cost.
He also advises, like the article I posted about yesterday, that piloting could help improve power calculations via better ICC estimates: “Furthermore, try to get a good estimate of the ICC – perhaps during the pilot phase by using a few clusters rather than just one: it may cost a little more at that time, but could save a lot more during the regular survey phase.”
My only issue with Ӧzler’s post is his chart, which shows the tradeoffs between effect size and the number of clusters. His horizontal axis is labeled “Total number of clusters” – per arm or in total, Bert?!? It’s per arm, not total across all arms. There should be more standardized and intuitive language for describing sample size in power calcs.